1. 08 Dec, 2020 40 commits
    • Qu Wenruo's avatar
      btrfs: use nodesize to determine if we need readahead in btrfs_lookup_bio_sums · 35478d05
      Qu Wenruo authored
      In btrfs_lookup_bio_sums() if the bio is pretty large, we want to
      start readahead in the csum tree.
      
      However the threshold is an immediate number, (PAGE_SIZE * 8), from the
      initial btrfs merge.
      
      The meaning of the value is pretty hard to guess, especially when the
      immediate number is from the times when 4K sectorsize was the default
      and only CRC32C was supported.
      
      For the most common btrfs setup, CRC32 csum and 4K sectorsize,
      it means just 32K read would kick readahead, while the csum itself is
      only 32 bytes in size.
      
      Now let's be more reasonable by taking both csum size and node size into
      consideration.
      
      If the csum size for the bio is larger than one leaf, then we kick the
      readahead.  This means for current default btrfs, the threshold will be
      16M.
      
      This change should not change performance observably, thus this is
      mostly a readability enhancement.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      35478d05
    • Qu Wenruo's avatar
      btrfs: only clear EXTENT_LOCK bit in extent_invalidatepage · 829ddec9
      Qu Wenruo authored
      extent_invalidatepage() will try to clear all possible bits since it's
      calling clear_extent_bit() with delete == 1.
      
      This is currently fine, since for btree io tree, it only utilizes
      EXTENT_LOCK bit.  But this could be a problem for later subpage support,
      which will utilize extra io tree bit to represent additional info.
      
      This patch will just convert that clear_extent_bit() to
      unlock_extent_cached().
      
      For current code since only EXTENT_LOCKED bit is utilized, this doesn't
      change the behavior, but provides a much cleaner basis for incoming
      subpage support.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      829ddec9
    • Qu Wenruo's avatar
      btrfs: remove unused parameter phy_offset from btrfs_validate_metadata_buffer · 8e1dc982
      Qu Wenruo authored
      Parameter @phy_offset is the offset against the bio->bi_iter.bi_sector.
      @phy_offset is mostly for data io to lookup the csum in btrfs_io_bio.
      
      But for metadata, it's completely useless as metadata stores their own
      csum in its header, so we can remove it.
      
      Note: parameters @start and @end, they are not utilized at all for
      current sectorsize == PAGE_SIZE case, as we can grab eb directly from
      page.
      
      But those two parameters are very important for later subpage support,
      thus @start/@len are not touched here.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8e1dc982
    • Qu Wenruo's avatar
      btrfs: scrub: remove the anonymous structure from scrub_page · 2c363954
      Qu Wenruo authored
      That anonymous structure serve no special purpose, just replace it with
      regular members.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2c363954
    • Qu Wenruo's avatar
      btrfs: use fixed width int type for extent_state::state · f97e27e9
      Qu Wenruo authored
      Currently the type is unsigned int which could change its width
      depending on the architecture. We need up to 32 bits so make it
      explicit.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f97e27e9
    • Qu Wenruo's avatar
      btrfs: introduce helper to handle page status update in end_bio_extent_readpage() · e09caaf9
      Qu Wenruo authored
      Introduce a new helper to handle update page status in
      end_bio_extent_readpage(). This will be later used for subpage support
      where the page status update can be more complex than now.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e09caaf9
    • Qu Wenruo's avatar
      btrfs: add structure to keep track of extent range in end_bio_extent_readpage · 94e8c95c
      Qu Wenruo authored
      In end_bio_extent_readpage() we had a strange dance around
      extent_start/extent_len.
      
      Hidden behind the strange dance is, it's just calling
      endio_readpage_release_extent() on each bvec range.
      
      Here is an example to explain the original work flow:
      
        Bio is for inode 257, containing 2 pages, for range [1M, 1M+8K)
      
        end_bio_extent_extent_readpage() entered
        |- extent_start = 0;
        |- extent_end = 0;
        |- bio_for_each_segment_all() {
        |  |- /* Got the 1st bvec */
        |  |- start = SZ_1M;
        |  |- end = SZ_1M + SZ_4K - 1;
        |  |- update = 1;
        |  |- if (extent_len == 0) {
        |  |  |- extent_start = start; /* SZ_1M */
        |  |  |- extent_len = end + 1 - start; /* SZ_1M */
        |  |  }
        |  |
        |  |- /* Got the 2nd bvec */
        |  |- start = SZ_1M + 4K;
        |  |- end = SZ_1M + 4K - 1;
        |  |- update = 1;
        |  |- if (extent_start + extent_len == start) {
        |  |  |- extent_len += end + 1 - start; /* SZ_8K */
        |  |  }
        |  } /* All bio vec iterated */
        |
        |- if (extent_len) {
           |- endio_readpage_release_extent(tree, extent_start, extent_len,
      				      update);
      	/* extent_start == SZ_1M, extent_len == SZ_8K, uptodate = 1 */
      
      As the above flow shows, the existing code in end_bio_extent_readpage()
      is accumulates extent_start/extent_len, and when the contiguous range
      stops, calls endio_readpage_release_extent() for the range.
      
      However current behavior has something not really considered:
      
      - The inode can change
        For bio, its pages don't need to have contiguous page_offset.
        This means, even pages from different inodes can be packed into one
        bio.
      
      - bvec cross page boundary
        There is a feature called multi-page bvec, where bvec->bv_len can go
        beyond bvec->bv_page boundary.
      
      - Poor readability
      
      This patch will address the problem:
      
      - Introduce a proper structure, processed_extent, to record processed
        extent range
      
      - Integrate inode/start/end/uptodate check into
        endio_readpage_release_extent()
      
      - Add more comment on each step.
        This should greatly improve the readability, now in
        end_bio_extent_readpage() there are only two
        endio_readpage_release_extent() calls.
      
      - Add inode check for contiguity
        Now we also ensure the inode is the same one before checking if the
        range is contiguous.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      94e8c95c
    • Qu Wenruo's avatar
      btrfs: tests: remove invalid extent-io test · b1d51f67
      Qu Wenruo authored
      In extent-io-test, there are two invalid tests:
      
      - Invalid nodesize for test_eb_bitmaps()
        Instead of the sectorsize and nodesize combination passed in, we're
        always using hand-crafted nodesize, e.g:
      
      	len = (sectorsize < BTRFS_MAX_METADATA_BLOCKSIZE)
      		? sectorsize * 4 : sectorsize;
      
        In above case, if we have 32K page size, then we will get a length of
        128K, which is beyond max node size, and obviously invalid.
      
        The common page size goes up to 64K so we haven't hit that
      
      - Invalid extent buffer bytenr
        For 64K page size, the only combination we're going to test is
        sectorsize = nodesize = 64K.
        However, in that case we will try to test an eb which bytenr is not
        sectorsize aligned:
      
      	/* Do it over again with an extent buffer which isn't page-aligned. */
      	eb = __alloc_dummy_extent_buffer(fs_info, nodesize / 2, len);
      
        Sector alignment is a hard requirement for any sector size.
        The only exception is superblock. But anything else should follow
        sector size alignment.
      
        This is definitely an invalid test case.
      
      This patch will fix both problems by:
      
      - Honor the sectorsize/nodesize combination
        Now we won't bother to hand-craft the length and use it as nodesize.
      
      - Use sectorsize as the 2nd run extent buffer start
        This would test the case where extent buffer is aligned to sectorsize
        but not always aligned to nodesize.
      
      Please note that, later subpage related cleanup will reduce
      extent_buffer::pages[] to exactly what we need, making the sector
      unaligned extent buffer operations cause problems.
      
      Since only extent_io self tests utilize this, this patch is required for
      all later cleanup/refactoring.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b1d51f67
    • Tom Rix's avatar
      btrfs: sysfs: remove unneeded semicolon · 445d8ab5
      Tom Rix authored
      A semicolon is not needed after a switch statement.
      Signed-off-by: default avatarTom Rix <trix@redhat.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      445d8ab5
    • Nikolay Borisov's avatar
      btrfs: simplify return values in setup_nodes_for_search · 95b982de
      Nikolay Borisov authored
      The function is needlessly convoluted. Fix that by:
      
      * removing redundant sret variable definition in both if arms
      
      * replace the again/done labels with direct return statements, the
        function is short enough and doesn't do anything special upon exit
      
      * remove BUG_ON on split_node returning a positive number - it can't
        happen as split_node returns either 0 or a negative error code.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      95b982de
    • Nikolay Borisov's avatar
      btrfs: remove useless return value statement in split_node · d5286a92
      Nikolay Borisov authored
      At the point when we set 'ret = 0' it's guaranteed that the function is
      going to return 0 so directly return 0. No functional changes.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d5286a92
    • Filipe Manana's avatar
      btrfs: remove unnecessary attempt to drop extent maps after adding inline extent · f30bed83
      Filipe Manana authored
      At inode.c:cow_file_range_inline(), after we insert the inline extent
      in the fs/subvolume btree, we call btrfs_drop_extent_cache() to drop
      all extent maps in the file range, however that is not necessary because
      we have already done it in the call to btrfs_drop_extents(), which calls
      btrfs_drop_extent_cache() for us, and since at this point we have the file
      range locked in the inode's iotree (we are in the writeback path), we know
      no other task can come in and read stale file extent items or find none
      and therefore create either stale extent maps or an extent map that
      represents a hole.
      
      So just remove that unnecessary call to btrfs_drop_extent_cache(), as it's
      doing nothing and only wasting time. This call has been around since 2008,
      introduced in commit c8b97818 ("Btrfs: Add zlib compression support"),
      but even back then it seems it was not necessary, since we had the range
      locked in the inode's iotree and the call to btrfs_drop_extents() already
      used to always call btrfs_drop_extent_cache().
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f30bed83
    • Filipe Manana's avatar
      btrfs: stop incrementing log batch when joining log transaction · bc5b5b1e
      Filipe Manana authored
      When joining a log transaction we acquire the root's log mutex, then
      increment the root's log batch and log writers counters while holding
      the mutex. However we don't need to increment the log batch there,
      because we are holding the mutex and incremented the log writers counter
      as well, so any other task trying to sync log will wait for the current
      task to finish its logging and still achieve the desired log batching.
      
      Since the log batch counter is an atomic counter and is incremented twice
      at the very beginning of the fsync callback (btrfs_sync_file()), once
      before flushing delalloc and once again after waiting for writeback to
      complete, eliminating its increment when joining the log transaction
      may provide some performance gains in case we have multiple concurrent
      tasks doing fsyncs against different files in the same subvolume, as it
      reduces contention on the atomic (locking the cacheline and bouncing it).
      
      When testing fio with 32 jobs, on a 8 cores VM, doing fsyncs against
      different files of the same subvolume, on top of a zram device, I could
      consistently see gains (higher throughput) between 1% to 2%, which is a
      very low value and possibly hard to be observed with a real device (I
      couldn't observe consistent gains with my low/mid end NVMe device).
      So this change is mostly motivated to just simplify the logic, as updating
      the log batch counter is only relevant when an fsync starts and while not
      holding the root's log mutex.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bc5b5b1e
    • Filipe Manana's avatar
      btrfs: skip unnecessary searches for xattrs when logging an inode · f2f121ab
      Filipe Manana authored
      Every time we log an inode we lookup in the fs/subvol tree for xattrs and
      if we have any, log them into the log tree. However it is very common to
      have inodes without any xattrs, so doing the search wastes times, but more
      importantly it adds contention on the fs/subvol tree locks, either making
      the logging code block and wait for tree locks or making the logging code
      making other concurrent operations block and wait.
      
      The most typical use cases where xattrs are used are when capabilities or
      ACLs are defined for an inode, or when SELinux is enabled.
      
      This change makes the logging code detect when an inode does not have
      xattrs and skip the xattrs search the next time the inode is logged,
      unless the inode is evicted and loaded again or a xattr is added to the
      inode. Therefore skipping the search for xattrs on inodes that don't ever
      have xattrs and are fsynced with some frequency.
      
      The following script that calls dbench was used to measure the impact of
      this change on a VM with 8 CPUs, 16Gb of ram, using a raw NVMe device
      directly (no intermediary filesystem on the host) and using a non-debug
      kernel (default configuration on Debian distributions):
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/sdk
        MNT=/mnt/sdk
        MOUNT_OPTIONS="-o ssd"
      
        mkfs.btrfs -f -m single -d single $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
      
        dbench -D $MNT -t 200 40
      
        umount $MNT
      
      The results before this change:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    5761605     0.172   312.057
       Close        4232452     0.002    10.927
       Rename        243937     1.406   277.344
       Unlink       1163456     0.631   298.402
       Deltree          160    11.581   221.107
       Mkdir             80     0.003     0.005
       Qpathinfo    5221410     0.065   122.309
       Qfileinfo     915432     0.001     3.333
       Qfsinfo       957555     0.003     3.992
       Sfileinfo     469244     0.023    20.494
       Find         2018865     0.448   123.659
       WriteX       2874851     0.049   118.529
       ReadX        9030579     0.004    21.654
       LockX          18754     0.003     4.423
       UnlockX        18754     0.002     0.331
       Flush         403792    10.944   359.494
      
      Throughput 908.444 MB/sec  40 clients  40 procs  max_latency=359.500 ms
      
      The results after this change:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    6442521     0.159   230.693
       Close        4732357     0.002    10.972
       Rename        272809     1.293   227.398
       Unlink       1301059     0.563   218.500
       Deltree          160     7.796    54.887
       Mkdir             80     0.008     0.478
       Qpathinfo    5839452     0.047   124.330
       Qfileinfo    1023199     0.001     4.996
       Qfsinfo      1070760     0.003     5.709
       Sfileinfo     524790     0.033    21.765
       Find         2257658     0.314   125.611
       WriteX       3211520     0.040   232.135
       ReadX        10098969     0.004    25.340
       LockX          20974     0.003     1.569
       UnlockX        20974     0.002     3.475
       Flush         451553    10.287   331.037
      
      Throughput 1011.77 MB/sec  40 clients  40 procs  max_latency=331.045 ms
      
      +10.8% throughput, -8.2% max latency
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f2f121ab
    • Nikolay Borisov's avatar
      btrfs: merge __set_extent_bit and set_extent_bit · 1cab5e72
      Nikolay Borisov authored
      There are only 2 direct calls to set_extent_bit outside of extent-io -
      in btrfs_find_new_delalloc_bytes and btrfs_truncate_block, the rest are
      thin wrappers around __set_extent_bit. This adds unnecessary indirection
      and just makes it more annoying when looking at the various extent bit
      manipulation functions.  This patch renames __set_extent_bit to
      set_extent_bit effectively removing a level of indirection. No
      functional changes.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ reformat and remove __must_check ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1cab5e72
    • Nikolay Borisov's avatar
    • Nikolay Borisov's avatar
    • Nikolay Borisov's avatar
    • Nikolay Borisov's avatar
    • Nikolay Borisov's avatar
    • Nikolay Borisov's avatar
    • Nikolay Borisov's avatar
    • Nikolay Borisov's avatar
    • Nikolay Borisov's avatar
    • Nikolay Borisov's avatar
    • Nikolay Borisov's avatar
    • Nikolay Borisov's avatar
    • Nikolay Borisov's avatar
    • Josef Bacik's avatar
      btrfs: remove extent_buffer::recursed · a55463c9
      Josef Bacik authored
      It is unused everywhere now, it can be removed.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a55463c9
    • Josef Bacik's avatar
      btrfs: remove the recurse parameter from __btrfs_tree_read_lock · 0ecae6ff
      Josef Bacik authored
      It is completely unused now, remove it.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0ecae6ff
    • Josef Bacik's avatar
      btrfs: use btrfs_tree_read_lock in btrfs_search_slot · fe596ca3
      Josef Bacik authored
      We no longer use recursion, so
      __btrfs_tree_read_lock(BTRFS_NESTING_NORMAL) == btrfs_tree_read_lock.
      Replace this call with the simple helper.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fe596ca3
    • Josef Bacik's avatar
      btrfs: merge back btrfs_read_lock_root_node helpers · 1bb96598
      Josef Bacik authored
      We no longer have recursive locking and there's no need for separate
      helpers that allowed the transition to rwsem with minimal code changes.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1bb96598
    • Josef Bacik's avatar
      btrfs: locking: remove the recursion handling code · 4048daed
      Josef Bacik authored
      Now that we're no longer using recursion, rip out all of the supporting
      code.  Follow up patches will clean up the callers of these functions.
      
      The extent_buffer::lock_owner is still retained as it allows safety
      checks in btrfs_init_new_buffer for the case that the free space cache
      is corrupted and we try to allocate a block that we are currently using
      and have locked in the path.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4048daed
    • Josef Bacik's avatar
      btrfs: remove btrfs_path::recurse · 2f5239dc
      Josef Bacik authored
      With my async free space cache loading patches ("btrfs: load free space
      cache asynchronously") we no longer have a user of path->recurse and can
      remove it.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2f5239dc
    • Josef Bacik's avatar
      btrfs: unlock to current level in btrfs_next_old_leaf · 0e46318d
      Josef Bacik authored
      Filipe reported the following lockdep splat
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.10.0-rc2-btrfs-next-71 #1 Not tainted
        ------------------------------------------------------
        find/324157 is trying to acquire lock:
        ffff8ebc48d293a0 (btrfs-tree-01#2/3){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
      
        but task is already holding lock:
        ffff8eb9932c5088 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #1 (btrfs-tree-00){++++}-{3:3}:
      	 lock_acquire+0xd8/0x490
      	 down_write_nested+0x44/0x120
      	 __btrfs_tree_lock+0x27/0x120 [btrfs]
      	 btrfs_search_slot+0x2a3/0xc50 [btrfs]
      	 btrfs_insert_empty_items+0x58/0xa0 [btrfs]
      	 insert_with_overflow+0x44/0x110 [btrfs]
      	 btrfs_insert_xattr_item+0xb8/0x1d0 [btrfs]
      	 btrfs_setxattr+0xd6/0x4c0 [btrfs]
      	 btrfs_setxattr_trans+0x68/0x100 [btrfs]
      	 __vfs_setxattr+0x66/0x80
      	 __vfs_setxattr_noperm+0x70/0x200
      	 vfs_setxattr+0x6b/0x120
      	 setxattr+0x125/0x240
      	 path_setxattr+0xba/0xd0
      	 __x64_sys_setxattr+0x27/0x30
      	 do_syscall_64+0x33/0x80
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #0 (btrfs-tree-01#2/3){++++}-{3:3}:
      	 check_prev_add+0x91/0xc60
      	 __lock_acquire+0x1689/0x3130
      	 lock_acquire+0xd8/0x490
      	 down_read_nested+0x45/0x220
      	 __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
      	 btrfs_next_old_leaf+0x27d/0x580 [btrfs]
      	 btrfs_real_readdir+0x1e3/0x4b0 [btrfs]
      	 iterate_dir+0x170/0x1c0
      	 __x64_sys_getdents64+0x83/0x140
      	 do_syscall_64+0x33/0x80
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        other info that might help us debug this:
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(btrfs-tree-00);
      				 lock(btrfs-tree-01#2/3);
      				 lock(btrfs-tree-00);
          lock(btrfs-tree-01#2/3);
      
         *** DEADLOCK ***
      
        5 locks held by find/324157:
         #0: ffff8ebc502c6e00 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0x4d/0x60
         #1: ffff8eb97f689980 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: iterate_dir+0x52/0x1c0
         #2: ffff8ebaec00ca58 (btrfs-tree-02#2){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
         #3: ffff8eb98f986f78 (btrfs-tree-01#2){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
         #4: ffff8eb9932c5088 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
      
        stack backtrace:
        CPU: 2 PID: 324157 Comm: find Not tainted 5.10.0-rc2-btrfs-next-71 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        Call Trace:
         dump_stack+0x8d/0xb5
         check_noncircular+0xff/0x110
         ? mark_lock.part.0+0x468/0xe90
         check_prev_add+0x91/0xc60
         __lock_acquire+0x1689/0x3130
         ? kvm_clock_read+0x14/0x30
         ? kvm_sched_clock_read+0x5/0x10
         lock_acquire+0xd8/0x490
         ? __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
         down_read_nested+0x45/0x220
         ? __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
         __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
         btrfs_next_old_leaf+0x27d/0x580 [btrfs]
         btrfs_real_readdir+0x1e3/0x4b0 [btrfs]
         iterate_dir+0x170/0x1c0
         __x64_sys_getdents64+0x83/0x140
         ? filldir+0x1d0/0x1d0
         do_syscall_64+0x33/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      This happens because btrfs_next_old_leaf searches down to our current
      key, and then walks up the path until we can move to the next slot, and
      then reads back down the path so we get the next leaf.
      
      However it doesn't unlock any lower levels until it replaces them with
      the new extent buffer.  This is technically fine, but of course causes
      lockdep to complain, because we could be holding locks on lower levels
      while locking upper levels.
      
      Fix this by dropping all nodes below the level that we use as our new
      starting point before we start reading back down the path.  This also
      allows us to drop the nested/recursive locking magic, because we're no
      longer locking two nodes at the same level anymore.
      Reported-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0e46318d
    • Josef Bacik's avatar
      btrfs: cleanup the locking in btrfs_next_old_leaf · ffeb03cf
      Josef Bacik authored
      We are carrying around this next_rw_lock from when we would do spinning
      vs blocking read locks.  Now that we have the rwsem locking we can
      simply use the read lock flag unconditionally and the read lock helpers.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ffeb03cf
    • Anand Jain's avatar
      btrfs: remove unused argument seed from btrfs_find_device · b2598edf
      Anand Jain authored
      Commit 343694eee8d8 ("btrfs: switch seed device to list api"), missed to
      check if the parameter seed is true in the function btrfs_find_device().
      This tells it whether to traverse the seed device list or not.
      
      After this commit, the argument is unused and can be removed.
      
      In device_list_add() it's not necessary because fs_devices always points
      to the device's fs_devices. So with the devid+uuid matching, it will
      find the right device and return, thus not needing to traverse seed
      devices.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b2598edf
    • Anand Jain's avatar
      btrfs: drop never met disk total bytes check in verify_one_dev_extent · 3a160a93
      Anand Jain authored
      Drop the condition in verify_one_dev_extent,
      btrfs_device::disk_total_bytes is set even for a seed device. The
      comment is wrong, the size is properly set when cloning the device.
      
      Commit 1b3922a8 ("btrfs: Use real device structure to verify
      dev extent") introduced it but it's unclear why the total_disk_bytes
      was 0.
      
      Theoretically, all devices (including missing and seed) marked with the
      BTRFS_DEV_STATE_IN_FS_METADATA flag gets the total_disk_bytes updated at
      fill_device_from_item():
      
        open_ctree()
          btrfs_read_chunk_tree()
            read_one_dev()
              open_seed_device()
              fill_device_from_item()
      
      Even if verify_one_dev_extent() reports total_disk_bytes == 0, then its
      a bug to be fixed somewhere else and not in verify_one_dev_extent() as
      it's just a messenger. It is never expected that a total_disk_bytes
      shall be zero.
      
      The function fill_device_from_item() does the job of reading it from the
      item and updating btrfs_device::disk_total_bytes. So both the missing
      device and the seed devices do have their disk_total_bytes updated.
      btrfs_find_device can also return a device from fs_info->seed_list
      because it searches it as well.
      
      Furthermore, while removing the device if there is a power loss, we
      could have a device with its total_bytes = 0, that's still valid.
      
      Instead, introduce a check against maximum block device size in
      read_one_dev().
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3a160a93
    • Anand Jain's avatar
      btrfs: drop unused argument step from btrfs_free_extra_devids · bacce86a
      Anand Jain authored
      Commit cf89af14 ("btrfs: dev-replace: fail mount if we don't have
      replace item with target device") dropped the multi stage operation of
      btrfs_free_extra_devids() that does not need to check replace target
      anymore and we can remove the 'step' argument.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bacce86a
    • Filipe Manana's avatar
      btrfs: update the number of bytes used by an inode atomically · 2766ff61
      Filipe Manana authored
      There are several occasions where we do not update the inode's number of
      used bytes atomically, resulting in a concurrent stat(2) syscall to report
      a value of used blocks that does not correspond to a valid value, that is,
      a value that does not match neither what we had before the operation nor
      what we get after the operation completes.
      
      In extreme cases it can result in stat(2) reporting zero used blocks, which
      can cause problems for some userspace tools where they can consider a file
      with a non-zero size and zero used blocks as completely sparse and skip
      reading data, as reported/discussed a long time ago in some threads like
      the following:
      
        https://lists.gnu.org/archive/html/bug-tar/2016-07/msg00001.html
      
      The cases where this can happen are the following:
      
      -> Case 1
      
      If we do a write (buffered or direct IO) against a file region for which
      there is already an allocated extent (or multiple extents), then we have a
      short time window where we can report a number of used blocks to stat(2)
      that does not take into account the file region being overwritten. This
      short time window happens when completing the ordered extent(s).
      
      This happens because when we drop the extents in the write range we
      decrement the inode's number of bytes and later on when we insert the new
      extent(s) we increment the number of bytes in the inode, resulting in a
      short time window where a stat(2) syscall can get an incorrect number of
      used blocks.
      
      If we do writes that overwrite an entire file, then we have a short time
      window where we report 0 used blocks to stat(2).
      
      Example reproducer:
      
        $ cat reproducer-1.sh
        #!/bin/bash
      
        MNT=/mnt/sdi
        DEV=/dev/sdi
      
        stat_loop()
        {
            trap "wait; exit" SIGTERM
            local filepath=$1
            local expected=$2
            local got
      
            while :; do
                got=$(stat -c %b $filepath)
                if [ $got -ne $expected ]; then
                   echo -n "ERROR: unexpected used blocks"
                   echo " (got: $got expected: $expected)"
                fi
            done
        }
      
        mkfs.btrfs -f $DEV > /dev/null
        # mkfs.xfs -f $DEV > /dev/null
        # mkfs.ext4 -F $DEV > /dev/null
        # mkfs.f2fs -f $DEV > /dev/null
        # mkfs.reiserfs -f $DEV > /dev/null
        mount $DEV $MNT
      
        xfs_io -f -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null
        expected=$(stat -c %b $MNT/foobar)
      
        # Create a process to keep calling stat(2) on the file and see if the
        # reported number of blocks used (disk space used) changes, it should
        # not because we are not increasing the file size nor punching holes.
        stat_loop $MNT/foobar $expected &
        loop_pid=$!
      
        for ((i = 0; i < 50000; i++)); do
            xfs_io -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null
        done
      
        kill $loop_pid &> /dev/null
        wait
      
        umount $DEV
      
        $ ./reproducer-1.sh
        ERROR: unexpected used blocks (got: 0 expected: 128)
        ERROR: unexpected used blocks (got: 0 expected: 128)
        (...)
      
      Note that since this is a short time window where the race can happen, the
      reproducer may not be able to always trigger the bug in one run, or it may
      trigger it multiple times.
      
      -> Case 2
      
      If we do a buffered write against a file region that does not have any
      allocated extents, like a hole or beyond EOF, then during ordered extent
      completion we have a short time window where a concurrent stat(2) syscall
      can report a number of used blocks that does not correspond to the value
      before or after the write operation, a value that is actually larger than
      the value after the write completes.
      
      This happens because once we start a buffered write into an unallocated
      file range we increment the inode's 'new_delalloc_bytes', to make sure
      any stat(2) call gets a correct used blocks value before delalloc is
      flushed and completes. However at ordered extent completion, after we
      inserted the new extent, we increment the inode's number of bytes used
      with the size of the new extent, and only later, when clearing the range
      in the inode's iotree, we decrement the inode's 'new_delalloc_bytes'
      counter with the size of the extent. So this results in a short time
      window where a concurrent stat(2) syscall can report a number of used
      blocks that accounts for the new extent twice.
      
      Example reproducer:
      
        $ cat reproducer-2.sh
        #!/bin/bash
      
        MNT=/mnt/sdi
        DEV=/dev/sdi
      
        stat_loop()
        {
            trap "wait; exit" SIGTERM
            local filepath=$1
            local expected=$2
            local got
      
            while :; do
                got=$(stat -c %b $filepath)
                if [ $got -ne $expected ]; then
                    echo -n "ERROR: unexpected used blocks"
                    echo " (got: $got expected: $expected)"
                fi
            done
        }
      
        mkfs.btrfs -f $DEV > /dev/null
        # mkfs.xfs -f $DEV > /dev/null
        # mkfs.ext4 -F $DEV > /dev/null
        # mkfs.f2fs -f $DEV > /dev/null
        # mkfs.reiserfs -f $DEV > /dev/null
        mount $DEV $MNT
      
        touch $MNT/foobar
        write_size=$((64 * 1024))
        for ((i = 0; i < 16384; i++)); do
           offset=$(($i * $write_size))
           xfs_io -c "pwrite -S 0xab $offset $write_size" $MNT/foobar >/dev/null
           blocks_used=$(stat -c %b $MNT/foobar)
      
           # Fsync the file to trigger writeback and keep calling stat(2) on it
           # to see if the number of blocks used changes.
           stat_loop $MNT/foobar $blocks_used &
           loop_pid=$!
           xfs_io -c "fsync" $MNT/foobar
      
           kill $loop_pid &> /dev/null
           wait $loop_pid
        done
      
        umount $DEV
      
        $ ./reproducer-2.sh
        ERROR: unexpected used blocks (got: 265472 expected: 265344)
        ERROR: unexpected used blocks (got: 284032 expected: 283904)
        (...)
      
      Note that since this is a short time window where the race can happen, the
      reproducer may not be able to always trigger the bug in one run, or it may
      trigger it multiple times.
      
      -> Case 3
      
      Another case where such problems happen is during other operations that
      replace extents in a file range with other extents. Those operations are
      extent cloning, deduplication and fallocate's zero range operation.
      
      The cause of the problem is similar to the first case. When we drop the
      extents from a range, we decrement the inode's number of bytes, and later
      on, after inserting the new extents we increment it. Since this is not
      done atomically, a concurrent stat(2) call can see and return a number of
      used blocks that is smaller than it should be, does not match the number
      of used blocks before or after the clone/deduplication/zero operation.
      
      Like for the first case, when doing a clone, deduplication or zero range
      operation against an entire file, we end up having a time window where we
      can report 0 used blocks to a stat(2) call.
      
      Example reproducer:
      
        $ cat reproducer-3.sh
        #!/bin/bash
      
        MNT=/mnt/sdi
        DEV=/dev/sdi
      
        mkfs.btrfs -f $DEV > /dev/null
        # mkfs.xfs -f -m reflink=1 $DEV > /dev/null
        mount $DEV $MNT
      
        extent_size=$((64 * 1024))
        num_extents=16384
        file_size=$(($extent_size * $num_extents))
      
        # File foo has many small extents.
        xfs_io -f -s -c "pwrite -S 0xab -b $extent_size 0 $file_size" $MNT/foo \
            > /dev/null
        # File bar has much less extents and has exactly the same data as foo.
        xfs_io -f -c "pwrite -S 0xab 0 $file_size" $MNT/bar > /dev/null
      
        expected=$(stat -c %b $MNT/foo)
      
        # Now deduplicate bar into foo. While the deduplication is in progres,
        # the number of used blocks/file size reported by stat should not change
        xfs_io -c "dedupe $MNT/bar 0 0 $file_size" $MNT/foo > /dev/null  &
        dedupe_pid=$!
        while [ -n "$(ps -p $dedupe_pid -o pid=)" ]; do
            used=$(stat -c %b $MNT/foo)
            if [ $used -ne $expected ]; then
                echo "Unexpected blocks used: $used (expected: $expected)"
            fi
        done
      
        umount $DEV
      
        $ ./reproducer-3.sh
        Unexpected blocks used: 2076800 (expected: 2097152)
        Unexpected blocks used: 2097024 (expected: 2097152)
        Unexpected blocks used: 2079872 (expected: 2097152)
        (...)
      
      Note that since this is a short time window where the race can happen, the
      reproducer may not be able to always trigger the bug in one run, or it may
      trigger it multiple times.
      
      So fix this by:
      
      1) Making btrfs_drop_extents() not decrement the VFS inode's number of
         bytes, and instead return the number of bytes;
      
      2) Making any code that drops extents and adds new extents update the
         inode's number of bytes atomically, while holding the btrfs inode's
         spinlock, which is also used by the stat(2) callback to get the inode's
         number of bytes;
      
      3) For ranges in the inode's iotree that are marked as 'delalloc new',
         corresponding to previously unallocated ranges, increment the inode's
         number of bytes when clearing the 'delalloc new' bit from the range,
         in the same critical section that decrements the inode's
         'new_delalloc_bytes' counter, delimited by the btrfs inode's spinlock.
      
      An alternative would be to have btrfs_getattr() wait for any IO (ordered
      extents in progress) and locking the whole range (0 to (u64)-1) while it
      it computes the number of blocks used. But that would mean blocking
      stat(2), which is a very used syscall and expected to be fast, waiting
      for writes, clone/dedupe, fallocate, page reads, fiemap, etc.
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2766ff61