1. 08 Dec, 2020 40 commits
    • Filipe Manana's avatar
      btrfs: skip unnecessary searches for xattrs when logging an inode · f2f121ab
      Filipe Manana authored
      Every time we log an inode we lookup in the fs/subvol tree for xattrs and
      if we have any, log them into the log tree. However it is very common to
      have inodes without any xattrs, so doing the search wastes times, but more
      importantly it adds contention on the fs/subvol tree locks, either making
      the logging code block and wait for tree locks or making the logging code
      making other concurrent operations block and wait.
      
      The most typical use cases where xattrs are used are when capabilities or
      ACLs are defined for an inode, or when SELinux is enabled.
      
      This change makes the logging code detect when an inode does not have
      xattrs and skip the xattrs search the next time the inode is logged,
      unless the inode is evicted and loaded again or a xattr is added to the
      inode. Therefore skipping the search for xattrs on inodes that don't ever
      have xattrs and are fsynced with some frequency.
      
      The following script that calls dbench was used to measure the impact of
      this change on a VM with 8 CPUs, 16Gb of ram, using a raw NVMe device
      directly (no intermediary filesystem on the host) and using a non-debug
      kernel (default configuration on Debian distributions):
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/sdk
        MNT=/mnt/sdk
        MOUNT_OPTIONS="-o ssd"
      
        mkfs.btrfs -f -m single -d single $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
      
        dbench -D $MNT -t 200 40
      
        umount $MNT
      
      The results before this change:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    5761605     0.172   312.057
       Close        4232452     0.002    10.927
       Rename        243937     1.406   277.344
       Unlink       1163456     0.631   298.402
       Deltree          160    11.581   221.107
       Mkdir             80     0.003     0.005
       Qpathinfo    5221410     0.065   122.309
       Qfileinfo     915432     0.001     3.333
       Qfsinfo       957555     0.003     3.992
       Sfileinfo     469244     0.023    20.494
       Find         2018865     0.448   123.659
       WriteX       2874851     0.049   118.529
       ReadX        9030579     0.004    21.654
       LockX          18754     0.003     4.423
       UnlockX        18754     0.002     0.331
       Flush         403792    10.944   359.494
      
      Throughput 908.444 MB/sec  40 clients  40 procs  max_latency=359.500 ms
      
      The results after this change:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    6442521     0.159   230.693
       Close        4732357     0.002    10.972
       Rename        272809     1.293   227.398
       Unlink       1301059     0.563   218.500
       Deltree          160     7.796    54.887
       Mkdir             80     0.008     0.478
       Qpathinfo    5839452     0.047   124.330
       Qfileinfo    1023199     0.001     4.996
       Qfsinfo      1070760     0.003     5.709
       Sfileinfo     524790     0.033    21.765
       Find         2257658     0.314   125.611
       WriteX       3211520     0.040   232.135
       ReadX        10098969     0.004    25.340
       LockX          20974     0.003     1.569
       UnlockX        20974     0.002     3.475
       Flush         451553    10.287   331.037
      
      Throughput 1011.77 MB/sec  40 clients  40 procs  max_latency=331.045 ms
      
      +10.8% throughput, -8.2% max latency
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f2f121ab
    • Nikolay Borisov's avatar
      btrfs: merge __set_extent_bit and set_extent_bit · 1cab5e72
      Nikolay Borisov authored
      There are only 2 direct calls to set_extent_bit outside of extent-io -
      in btrfs_find_new_delalloc_bytes and btrfs_truncate_block, the rest are
      thin wrappers around __set_extent_bit. This adds unnecessary indirection
      and just makes it more annoying when looking at the various extent bit
      manipulation functions.  This patch renames __set_extent_bit to
      set_extent_bit effectively removing a level of indirection. No
      functional changes.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ reformat and remove __must_check ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1cab5e72
    • Nikolay Borisov's avatar
    • Nikolay Borisov's avatar
    • Nikolay Borisov's avatar
    • Nikolay Borisov's avatar
    • Nikolay Borisov's avatar
    • Nikolay Borisov's avatar
    • Nikolay Borisov's avatar
    • Nikolay Borisov's avatar
    • Nikolay Borisov's avatar
    • Nikolay Borisov's avatar
    • Nikolay Borisov's avatar
    • Nikolay Borisov's avatar
    • Nikolay Borisov's avatar
    • Josef Bacik's avatar
      btrfs: remove extent_buffer::recursed · a55463c9
      Josef Bacik authored
      It is unused everywhere now, it can be removed.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a55463c9
    • Josef Bacik's avatar
      btrfs: remove the recurse parameter from __btrfs_tree_read_lock · 0ecae6ff
      Josef Bacik authored
      It is completely unused now, remove it.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0ecae6ff
    • Josef Bacik's avatar
      btrfs: use btrfs_tree_read_lock in btrfs_search_slot · fe596ca3
      Josef Bacik authored
      We no longer use recursion, so
      __btrfs_tree_read_lock(BTRFS_NESTING_NORMAL) == btrfs_tree_read_lock.
      Replace this call with the simple helper.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fe596ca3
    • Josef Bacik's avatar
      btrfs: merge back btrfs_read_lock_root_node helpers · 1bb96598
      Josef Bacik authored
      We no longer have recursive locking and there's no need for separate
      helpers that allowed the transition to rwsem with minimal code changes.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1bb96598
    • Josef Bacik's avatar
      btrfs: locking: remove the recursion handling code · 4048daed
      Josef Bacik authored
      Now that we're no longer using recursion, rip out all of the supporting
      code.  Follow up patches will clean up the callers of these functions.
      
      The extent_buffer::lock_owner is still retained as it allows safety
      checks in btrfs_init_new_buffer for the case that the free space cache
      is corrupted and we try to allocate a block that we are currently using
      and have locked in the path.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4048daed
    • Josef Bacik's avatar
      btrfs: remove btrfs_path::recurse · 2f5239dc
      Josef Bacik authored
      With my async free space cache loading patches ("btrfs: load free space
      cache asynchronously") we no longer have a user of path->recurse and can
      remove it.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2f5239dc
    • Josef Bacik's avatar
      btrfs: unlock to current level in btrfs_next_old_leaf · 0e46318d
      Josef Bacik authored
      Filipe reported the following lockdep splat
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.10.0-rc2-btrfs-next-71 #1 Not tainted
        ------------------------------------------------------
        find/324157 is trying to acquire lock:
        ffff8ebc48d293a0 (btrfs-tree-01#2/3){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
      
        but task is already holding lock:
        ffff8eb9932c5088 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #1 (btrfs-tree-00){++++}-{3:3}:
      	 lock_acquire+0xd8/0x490
      	 down_write_nested+0x44/0x120
      	 __btrfs_tree_lock+0x27/0x120 [btrfs]
      	 btrfs_search_slot+0x2a3/0xc50 [btrfs]
      	 btrfs_insert_empty_items+0x58/0xa0 [btrfs]
      	 insert_with_overflow+0x44/0x110 [btrfs]
      	 btrfs_insert_xattr_item+0xb8/0x1d0 [btrfs]
      	 btrfs_setxattr+0xd6/0x4c0 [btrfs]
      	 btrfs_setxattr_trans+0x68/0x100 [btrfs]
      	 __vfs_setxattr+0x66/0x80
      	 __vfs_setxattr_noperm+0x70/0x200
      	 vfs_setxattr+0x6b/0x120
      	 setxattr+0x125/0x240
      	 path_setxattr+0xba/0xd0
      	 __x64_sys_setxattr+0x27/0x30
      	 do_syscall_64+0x33/0x80
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #0 (btrfs-tree-01#2/3){++++}-{3:3}:
      	 check_prev_add+0x91/0xc60
      	 __lock_acquire+0x1689/0x3130
      	 lock_acquire+0xd8/0x490
      	 down_read_nested+0x45/0x220
      	 __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
      	 btrfs_next_old_leaf+0x27d/0x580 [btrfs]
      	 btrfs_real_readdir+0x1e3/0x4b0 [btrfs]
      	 iterate_dir+0x170/0x1c0
      	 __x64_sys_getdents64+0x83/0x140
      	 do_syscall_64+0x33/0x80
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        other info that might help us debug this:
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(btrfs-tree-00);
      				 lock(btrfs-tree-01#2/3);
      				 lock(btrfs-tree-00);
          lock(btrfs-tree-01#2/3);
      
         *** DEADLOCK ***
      
        5 locks held by find/324157:
         #0: ffff8ebc502c6e00 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0x4d/0x60
         #1: ffff8eb97f689980 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: iterate_dir+0x52/0x1c0
         #2: ffff8ebaec00ca58 (btrfs-tree-02#2){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
         #3: ffff8eb98f986f78 (btrfs-tree-01#2){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
         #4: ffff8eb9932c5088 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
      
        stack backtrace:
        CPU: 2 PID: 324157 Comm: find Not tainted 5.10.0-rc2-btrfs-next-71 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        Call Trace:
         dump_stack+0x8d/0xb5
         check_noncircular+0xff/0x110
         ? mark_lock.part.0+0x468/0xe90
         check_prev_add+0x91/0xc60
         __lock_acquire+0x1689/0x3130
         ? kvm_clock_read+0x14/0x30
         ? kvm_sched_clock_read+0x5/0x10
         lock_acquire+0xd8/0x490
         ? __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
         down_read_nested+0x45/0x220
         ? __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
         __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
         btrfs_next_old_leaf+0x27d/0x580 [btrfs]
         btrfs_real_readdir+0x1e3/0x4b0 [btrfs]
         iterate_dir+0x170/0x1c0
         __x64_sys_getdents64+0x83/0x140
         ? filldir+0x1d0/0x1d0
         do_syscall_64+0x33/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      This happens because btrfs_next_old_leaf searches down to our current
      key, and then walks up the path until we can move to the next slot, and
      then reads back down the path so we get the next leaf.
      
      However it doesn't unlock any lower levels until it replaces them with
      the new extent buffer.  This is technically fine, but of course causes
      lockdep to complain, because we could be holding locks on lower levels
      while locking upper levels.
      
      Fix this by dropping all nodes below the level that we use as our new
      starting point before we start reading back down the path.  This also
      allows us to drop the nested/recursive locking magic, because we're no
      longer locking two nodes at the same level anymore.
      Reported-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0e46318d
    • Josef Bacik's avatar
      btrfs: cleanup the locking in btrfs_next_old_leaf · ffeb03cf
      Josef Bacik authored
      We are carrying around this next_rw_lock from when we would do spinning
      vs blocking read locks.  Now that we have the rwsem locking we can
      simply use the read lock flag unconditionally and the read lock helpers.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ffeb03cf
    • Anand Jain's avatar
      btrfs: remove unused argument seed from btrfs_find_device · b2598edf
      Anand Jain authored
      Commit 343694eee8d8 ("btrfs: switch seed device to list api"), missed to
      check if the parameter seed is true in the function btrfs_find_device().
      This tells it whether to traverse the seed device list or not.
      
      After this commit, the argument is unused and can be removed.
      
      In device_list_add() it's not necessary because fs_devices always points
      to the device's fs_devices. So with the devid+uuid matching, it will
      find the right device and return, thus not needing to traverse seed
      devices.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b2598edf
    • Anand Jain's avatar
      btrfs: drop never met disk total bytes check in verify_one_dev_extent · 3a160a93
      Anand Jain authored
      Drop the condition in verify_one_dev_extent,
      btrfs_device::disk_total_bytes is set even for a seed device. The
      comment is wrong, the size is properly set when cloning the device.
      
      Commit 1b3922a8 ("btrfs: Use real device structure to verify
      dev extent") introduced it but it's unclear why the total_disk_bytes
      was 0.
      
      Theoretically, all devices (including missing and seed) marked with the
      BTRFS_DEV_STATE_IN_FS_METADATA flag gets the total_disk_bytes updated at
      fill_device_from_item():
      
        open_ctree()
          btrfs_read_chunk_tree()
            read_one_dev()
              open_seed_device()
              fill_device_from_item()
      
      Even if verify_one_dev_extent() reports total_disk_bytes == 0, then its
      a bug to be fixed somewhere else and not in verify_one_dev_extent() as
      it's just a messenger. It is never expected that a total_disk_bytes
      shall be zero.
      
      The function fill_device_from_item() does the job of reading it from the
      item and updating btrfs_device::disk_total_bytes. So both the missing
      device and the seed devices do have their disk_total_bytes updated.
      btrfs_find_device can also return a device from fs_info->seed_list
      because it searches it as well.
      
      Furthermore, while removing the device if there is a power loss, we
      could have a device with its total_bytes = 0, that's still valid.
      
      Instead, introduce a check against maximum block device size in
      read_one_dev().
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3a160a93
    • Anand Jain's avatar
      btrfs: drop unused argument step from btrfs_free_extra_devids · bacce86a
      Anand Jain authored
      Commit cf89af14 ("btrfs: dev-replace: fail mount if we don't have
      replace item with target device") dropped the multi stage operation of
      btrfs_free_extra_devids() that does not need to check replace target
      anymore and we can remove the 'step' argument.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bacce86a
    • Filipe Manana's avatar
      btrfs: update the number of bytes used by an inode atomically · 2766ff61
      Filipe Manana authored
      There are several occasions where we do not update the inode's number of
      used bytes atomically, resulting in a concurrent stat(2) syscall to report
      a value of used blocks that does not correspond to a valid value, that is,
      a value that does not match neither what we had before the operation nor
      what we get after the operation completes.
      
      In extreme cases it can result in stat(2) reporting zero used blocks, which
      can cause problems for some userspace tools where they can consider a file
      with a non-zero size and zero used blocks as completely sparse and skip
      reading data, as reported/discussed a long time ago in some threads like
      the following:
      
        https://lists.gnu.org/archive/html/bug-tar/2016-07/msg00001.html
      
      The cases where this can happen are the following:
      
      -> Case 1
      
      If we do a write (buffered or direct IO) against a file region for which
      there is already an allocated extent (or multiple extents), then we have a
      short time window where we can report a number of used blocks to stat(2)
      that does not take into account the file region being overwritten. This
      short time window happens when completing the ordered extent(s).
      
      This happens because when we drop the extents in the write range we
      decrement the inode's number of bytes and later on when we insert the new
      extent(s) we increment the number of bytes in the inode, resulting in a
      short time window where a stat(2) syscall can get an incorrect number of
      used blocks.
      
      If we do writes that overwrite an entire file, then we have a short time
      window where we report 0 used blocks to stat(2).
      
      Example reproducer:
      
        $ cat reproducer-1.sh
        #!/bin/bash
      
        MNT=/mnt/sdi
        DEV=/dev/sdi
      
        stat_loop()
        {
            trap "wait; exit" SIGTERM
            local filepath=$1
            local expected=$2
            local got
      
            while :; do
                got=$(stat -c %b $filepath)
                if [ $got -ne $expected ]; then
                   echo -n "ERROR: unexpected used blocks"
                   echo " (got: $got expected: $expected)"
                fi
            done
        }
      
        mkfs.btrfs -f $DEV > /dev/null
        # mkfs.xfs -f $DEV > /dev/null
        # mkfs.ext4 -F $DEV > /dev/null
        # mkfs.f2fs -f $DEV > /dev/null
        # mkfs.reiserfs -f $DEV > /dev/null
        mount $DEV $MNT
      
        xfs_io -f -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null
        expected=$(stat -c %b $MNT/foobar)
      
        # Create a process to keep calling stat(2) on the file and see if the
        # reported number of blocks used (disk space used) changes, it should
        # not because we are not increasing the file size nor punching holes.
        stat_loop $MNT/foobar $expected &
        loop_pid=$!
      
        for ((i = 0; i < 50000; i++)); do
            xfs_io -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null
        done
      
        kill $loop_pid &> /dev/null
        wait
      
        umount $DEV
      
        $ ./reproducer-1.sh
        ERROR: unexpected used blocks (got: 0 expected: 128)
        ERROR: unexpected used blocks (got: 0 expected: 128)
        (...)
      
      Note that since this is a short time window where the race can happen, the
      reproducer may not be able to always trigger the bug in one run, or it may
      trigger it multiple times.
      
      -> Case 2
      
      If we do a buffered write against a file region that does not have any
      allocated extents, like a hole or beyond EOF, then during ordered extent
      completion we have a short time window where a concurrent stat(2) syscall
      can report a number of used blocks that does not correspond to the value
      before or after the write operation, a value that is actually larger than
      the value after the write completes.
      
      This happens because once we start a buffered write into an unallocated
      file range we increment the inode's 'new_delalloc_bytes', to make sure
      any stat(2) call gets a correct used blocks value before delalloc is
      flushed and completes. However at ordered extent completion, after we
      inserted the new extent, we increment the inode's number of bytes used
      with the size of the new extent, and only later, when clearing the range
      in the inode's iotree, we decrement the inode's 'new_delalloc_bytes'
      counter with the size of the extent. So this results in a short time
      window where a concurrent stat(2) syscall can report a number of used
      blocks that accounts for the new extent twice.
      
      Example reproducer:
      
        $ cat reproducer-2.sh
        #!/bin/bash
      
        MNT=/mnt/sdi
        DEV=/dev/sdi
      
        stat_loop()
        {
            trap "wait; exit" SIGTERM
            local filepath=$1
            local expected=$2
            local got
      
            while :; do
                got=$(stat -c %b $filepath)
                if [ $got -ne $expected ]; then
                    echo -n "ERROR: unexpected used blocks"
                    echo " (got: $got expected: $expected)"
                fi
            done
        }
      
        mkfs.btrfs -f $DEV > /dev/null
        # mkfs.xfs -f $DEV > /dev/null
        # mkfs.ext4 -F $DEV > /dev/null
        # mkfs.f2fs -f $DEV > /dev/null
        # mkfs.reiserfs -f $DEV > /dev/null
        mount $DEV $MNT
      
        touch $MNT/foobar
        write_size=$((64 * 1024))
        for ((i = 0; i < 16384; i++)); do
           offset=$(($i * $write_size))
           xfs_io -c "pwrite -S 0xab $offset $write_size" $MNT/foobar >/dev/null
           blocks_used=$(stat -c %b $MNT/foobar)
      
           # Fsync the file to trigger writeback and keep calling stat(2) on it
           # to see if the number of blocks used changes.
           stat_loop $MNT/foobar $blocks_used &
           loop_pid=$!
           xfs_io -c "fsync" $MNT/foobar
      
           kill $loop_pid &> /dev/null
           wait $loop_pid
        done
      
        umount $DEV
      
        $ ./reproducer-2.sh
        ERROR: unexpected used blocks (got: 265472 expected: 265344)
        ERROR: unexpected used blocks (got: 284032 expected: 283904)
        (...)
      
      Note that since this is a short time window where the race can happen, the
      reproducer may not be able to always trigger the bug in one run, or it may
      trigger it multiple times.
      
      -> Case 3
      
      Another case where such problems happen is during other operations that
      replace extents in a file range with other extents. Those operations are
      extent cloning, deduplication and fallocate's zero range operation.
      
      The cause of the problem is similar to the first case. When we drop the
      extents from a range, we decrement the inode's number of bytes, and later
      on, after inserting the new extents we increment it. Since this is not
      done atomically, a concurrent stat(2) call can see and return a number of
      used blocks that is smaller than it should be, does not match the number
      of used blocks before or after the clone/deduplication/zero operation.
      
      Like for the first case, when doing a clone, deduplication or zero range
      operation against an entire file, we end up having a time window where we
      can report 0 used blocks to a stat(2) call.
      
      Example reproducer:
      
        $ cat reproducer-3.sh
        #!/bin/bash
      
        MNT=/mnt/sdi
        DEV=/dev/sdi
      
        mkfs.btrfs -f $DEV > /dev/null
        # mkfs.xfs -f -m reflink=1 $DEV > /dev/null
        mount $DEV $MNT
      
        extent_size=$((64 * 1024))
        num_extents=16384
        file_size=$(($extent_size * $num_extents))
      
        # File foo has many small extents.
        xfs_io -f -s -c "pwrite -S 0xab -b $extent_size 0 $file_size" $MNT/foo \
            > /dev/null
        # File bar has much less extents and has exactly the same data as foo.
        xfs_io -f -c "pwrite -S 0xab 0 $file_size" $MNT/bar > /dev/null
      
        expected=$(stat -c %b $MNT/foo)
      
        # Now deduplicate bar into foo. While the deduplication is in progres,
        # the number of used blocks/file size reported by stat should not change
        xfs_io -c "dedupe $MNT/bar 0 0 $file_size" $MNT/foo > /dev/null  &
        dedupe_pid=$!
        while [ -n "$(ps -p $dedupe_pid -o pid=)" ]; do
            used=$(stat -c %b $MNT/foo)
            if [ $used -ne $expected ]; then
                echo "Unexpected blocks used: $used (expected: $expected)"
            fi
        done
      
        umount $DEV
      
        $ ./reproducer-3.sh
        Unexpected blocks used: 2076800 (expected: 2097152)
        Unexpected blocks used: 2097024 (expected: 2097152)
        Unexpected blocks used: 2079872 (expected: 2097152)
        (...)
      
      Note that since this is a short time window where the race can happen, the
      reproducer may not be able to always trigger the bug in one run, or it may
      trigger it multiple times.
      
      So fix this by:
      
      1) Making btrfs_drop_extents() not decrement the VFS inode's number of
         bytes, and instead return the number of bytes;
      
      2) Making any code that drops extents and adds new extents update the
         inode's number of bytes atomically, while holding the btrfs inode's
         spinlock, which is also used by the stat(2) callback to get the inode's
         number of bytes;
      
      3) For ranges in the inode's iotree that are marked as 'delalloc new',
         corresponding to previously unallocated ranges, increment the inode's
         number of bytes when clearing the 'delalloc new' bit from the range,
         in the same critical section that decrements the inode's
         'new_delalloc_bytes' counter, delimited by the btrfs inode's spinlock.
      
      An alternative would be to have btrfs_getattr() wait for any IO (ordered
      extents in progress) and locking the whole range (0 to (u64)-1) while it
      it computes the number of blocks used. But that would mean blocking
      stat(2), which is a very used syscall and expected to be fast, waiting
      for writes, clone/dedupe, fallocate, page reads, fiemap, etc.
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2766ff61
    • Filipe Manana's avatar
      btrfs: fix race when defragmenting leads to unnecessary IO · 7f458a38
      Filipe Manana authored
      When defragmenting we skip ranges that have holes or inline extents, so that
      we don't do unnecessary IO and waste space. We do this check when calling
      should_defrag_range() at btrfs_defrag_file(). However we do it without
      holding the inode's lock. The reason we do it like this is to avoid
      blocking other tasks for too long, that possibly want to operate on other
      file ranges, since after the call to should_defrag_range() and before
      locking the inode, we trigger a synchronous page cache readahead. However
      before we were able to lock the inode, some other task might have punched
      a hole in our range, or we may now have an inline extent there, in which
      case we should not set the range for defrag anymore since that would cause
      unnecessary IO and make us waste space (i.e. allocating extents to contain
      zeros for a hole).
      
      So after we locked the inode and the range in the iotree, check again if
      we have holes or an inline extent, and if we do, just skip the range.
      
      I hit this while testing my next patch that fixes races when updating an
      inode's number of bytes (subject "btrfs: update the number of bytes used
      by an inode atomically"), and it depends on this change in order to work
      correctly. Alternatively I could rework that other patch to detect holes
      and flag their range with the 'new delalloc' bit, but this itself fixes
      an efficiency problem due a race that from a functional point of view is
      not harmful (it could be triggered with btrfs/062 from fstests).
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7f458a38
    • Filipe Manana's avatar
      btrfs: refactor btrfs_drop_extents() to make it easier to extend · 5893dfb9
      Filipe Manana authored
      There are many arguments for __btrfs_drop_extents() and its wrapper
      btrfs_drop_extents(), which makes it hard to add more arguments to it and
      requires changing every caller. I have added a couple myself back in 2014
      commit 1acae57b ("Btrfs: faster file extent item replace operations")
      and therefore know firsthand that it is a bit cumbersome to add additional
      arguments to these functions.
      
      Since I will need to add more arguments in a subsequent bug fix, this
      change is preparatory work and adds a data structure that holds all the
      arguments, for both input and output, that are passed to this function,
      with some comments in the structure's definition mentioning what each
      field is and how it relates to other fields.
      
      Callers of this function need only to zero out the content of the
      structure and setup only the fields they need. This also removes the
      need to have both __btrfs_drop_extents() and btrfs_drop_extents(), so
      now we have a single function named btrfs_drop_extents() that takes a
      pointer to this new data structure (struct btrfs_drop_extents_args).
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5893dfb9
    • Josef Bacik's avatar
      btrfs: set the lockdep class for extent buffers on creation · e114c545
      Josef Bacik authored
      Both Filipe and Fedora QA recently hit the following lockdep splat:
      
        WARNING: possible recursive locking detected
        5.10.0-0.rc1.20201028gited8780e3.57.fc34.x86_64 #1 Not tainted
        --------------------------------------------
        rsync/2610 is trying to acquire lock:
        ffff89617ed48f20 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140
      
        but task is already holding lock:
        ffff8961757b1130 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140
      
        other info that might help us debug this:
         Possible unsafe locking scenario:
      	 CPU0
      	 ----
          lock(&eb->lock);
          lock(&eb->lock);
      
         *** DEADLOCK ***
         May be due to missing lock nesting notation
        2 locks held by rsync/2610:
         #0: ffff896107212b90 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: walk_component+0x10c/0x190
         #1: ffff8961757b1130 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140
      
        stack backtrace:
        CPU: 1 PID: 2610 Comm: rsync Not tainted 5.10.0-0.rc1.20201028gited8780e3.57.fc34.x86_64 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
        Call Trace:
         dump_stack+0x8b/0xb0
         __lock_acquire.cold+0x12d/0x2a4
         ? kvm_sched_clock_read+0x14/0x30
         ? sched_clock+0x5/0x10
         lock_acquire+0xc8/0x400
         ? btrfs_tree_read_lock_atomic+0x34/0x140
         ? read_block_for_search.isra.0+0xdd/0x320
         _raw_read_lock+0x3d/0xa0
         ? btrfs_tree_read_lock_atomic+0x34/0x140
         btrfs_tree_read_lock_atomic+0x34/0x140
         btrfs_search_slot+0x616/0x9a0
         btrfs_lookup_dir_item+0x6c/0xb0
         btrfs_lookup_dentry+0xa8/0x520
         ? lockdep_init_map_waits+0x4c/0x210
         btrfs_lookup+0xe/0x30
         __lookup_slow+0x10f/0x1e0
         walk_component+0x11b/0x190
         path_lookupat+0x72/0x1c0
         filename_lookup+0x97/0x180
         ? strncpy_from_user+0x96/0x1e0
         ? getname_flags.part.0+0x45/0x1a0
         vfs_statx+0x64/0x100
         ? lockdep_hardirqs_on_prepare+0xff/0x180
         ? _raw_spin_unlock_irqrestore+0x41/0x50
         __do_sys_newlstat+0x26/0x40
         ? lockdep_hardirqs_on_prepare+0xff/0x180
         ? syscall_enter_from_user_mode+0x27/0x80
         ? syscall_enter_from_user_mode+0x27/0x80
         do_syscall_64+0x33/0x40
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      I have also seen a report of lockdep complaining about the lock class
      that was looked up being the same as the lock class on the lock we were
      using, but I can't find the report.
      
      These are problems that occur because we do not have the lockdep class
      set on the extent buffer until _after_ we read the eb in properly.  This
      is problematic for concurrent readers, because we will create the extent
      buffer, lock it, and then attempt to read the extent buffer.
      
      If a second thread comes in and tries to do a search down the same path
      they'll get the above lockdep splat because the class isn't set properly
      on the extent buffer.
      
      There was a good reason for this, we generally didn't know the real
      owner of the eb until we read it, specifically in refcounted roots.
      
      However now all refcounted roots have the same class name, so we no
      longer need to worry about this.  For non-refcounted trees we know
      which root we're on based on the parent.
      
      Fix this by setting the lockdep class on the eb at creation time instead
      of read time.  This will fix the splat and the weirdness where the class
      changes in the middle of locking the block.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e114c545
    • Josef Bacik's avatar
      btrfs: pass the owner_root and level to alloc_extent_buffer · 3fbaf258
      Josef Bacik authored
      Now that we've plumbed all of the callers to have the owner root and the
      level, plumb it down into alloc_extent_buffer().
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3fbaf258
    • Josef Bacik's avatar
      btrfs: pass the root owner and level around for readahead · 5d81230b
      Josef Bacik authored
      The readahead infrastructure does raw reads of extent buffers, but we're
      going to need to know their owner and level in order to set the lockdep
      key properly, so plumb in the infrastructure that we'll need to have
      this information when we start allocating extent buffers.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5d81230b
    • Josef Bacik's avatar
      btrfs: pass root owner to read_tree_block · 1b7ec85e
      Josef Bacik authored
      In order to properly set the lockdep class of a newly allocated block we
      need to know the owner of the block.  For non-refcounted trees this is
      straightforward, we always know in advance what tree we're reading from.
      For refcounted trees we don't necessarily know, however all refcounted
      trees share the same lockdep class name, tree-<level>.
      
      Fix all the callers of read_tree_block() to pass in the root objectid
      we're using.  In places like relocation and backref we could probably
      unconditionally use 0, but just in case use the root when we have it,
      otherwise use 0 in the cases we don't have the root as it's going to be
      a refcounted tree anyway.
      
      This is a preparation patch for further changes.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1b7ec85e
    • Josef Bacik's avatar
      btrfs: use btrfs_read_node_slot in btrfs_qgroup_trace_subtree · 182c79fc
      Josef Bacik authored
      We're open-coding btrfs_read_node_slot() here, replace with the helper.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      182c79fc
    • Josef Bacik's avatar
      btrfs: use btrfs_read_node_slot in qgroup_trace_new_subtree_blocks · 3acfbd6a
      Josef Bacik authored
      We're open-coding btrfs_read_node_slot() here, replace with the helper.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3acfbd6a
    • Josef Bacik's avatar
      btrfs: use btrfs_read_node_slot in qgroup_trace_extent_swap · 6b2cb7cb
      Josef Bacik authored
      We're open-coding btrfs_read_node_slot() here, replace with the helper.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6b2cb7cb
    • Josef Bacik's avatar
      btrfs: use btrfs_read_node_slot in walk_down_tree · c990ada2
      Josef Bacik authored
      We're open-coding btrfs_read_node_slot() here, replace with the helper.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c990ada2
    • Josef Bacik's avatar
      btrfs: use btrfs_read_node_slot in replace_path · 6b3426be
      Josef Bacik authored
      We're open-coding btrfs_read_node_slot() here, replace with the helper.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6b3426be
    • Josef Bacik's avatar
      btrfs: use btrfs_read_node_slot in do_relocation · c9752536
      Josef Bacik authored
      We're open coding btrfs_read_node_slot in do_relocation, replace this
      with the proper helper.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c9752536
    • Josef Bacik's avatar
      btrfs: use btrfs_read_node_slot in walk_down_reloc_tree · 8ef385bb
      Josef Bacik authored
      We do not need to call read_tree_block() here, simply use the
      btrfs_read_node_slot helper.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8ef385bb