1. 27 Jul, 2020 40 commits
    • David Sterba's avatar
      btrfs: prefetch chunk tree leaves at mount · d85327b1
      David Sterba authored
      The whole chunk tree is read at mount time so we can utilize readahead
      to get the tree blocks to memory before we read the items. The idea is
      from Robbie, but instead of updating search slot readahead, this patch
      implements the chunk tree readahead manually from nodes on level 1.
      
      We've decided to do specific readahead optimizations and then unify them
      under a common API so we don't break everything by changing the search
      slot readahead logic.
      
      Higher chunk trees grow on large filesystems (many terabytes), and
      prefetching just level 1 seems to be sufficient. Provided example was
      from a 200TiB filesystem with chunk tree level 2.
      
      CC: Robbie Ko <robbieko@synology.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d85327b1
    • Johannes Thumshirn's avatar
      btrfs: add metadata_uuid to FS_INFO ioctl · 49bac897
      Johannes Thumshirn authored
      Add retrieval of the filesystem's metadata UUID to the fsinfo ioctl.
      This is driven by setting the BTRFS_FS_INFO_FLAG_METADATA_UUID flag in
      btrfs_ioctl_fs_info_args::flags.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      49bac897
    • Johannes Thumshirn's avatar
      btrfs: add filesystem generation to FS_INFO ioctl · 0fb408a5
      Johannes Thumshirn authored
      Add retrieval of the filesystem's generation to the fsinfo ioctl. This is
      driven by setting the BTRFS_FS_INFO_FLAG_GENERATION flag in
      btrfs_ioctl_fs_info_args::flags.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0fb408a5
    • Johannes Thumshirn's avatar
      btrfs: pass checksum type via BTRFS_IOC_FS_INFO ioctl · 137c5418
      Johannes Thumshirn authored
      With the recent addition of filesystem checksum types other than CRC32c,
      it is not anymore hard-coded which checksum type a btrfs filesystem uses.
      
      Up to now there is no good way to read the filesystem checksum, apart from
      reading the filesystem UUID and then query sysfs for the checksum type.
      
      Add a new csum_type and csum_size fields to the BTRFS_IOC_FS_INFO ioctl
      command which usually is used to query filesystem features. Also add a
      flags member indicating that the kernel responded with a set csum_type and
      csum_size field.
      
      For compatibility reasons, only return the csum_type and csum_size if
      the BTRFS_FS_INFO_FLAG_CSUM_INFO flag was passed to the kernel. Also
      clear any unknown flags so we don't pass false positives to user-space
      newer than the kernel.
      
      To simplify further additions to the ioctl, also switch the padding to a
      u8 array. Pahole was used to verify the result of this switch:
      
      The csum members are added before flags, which might look odd, but this
      is to keep the alignment requirements and not to introduce holes in the
      structure.
      
        $ pahole -C btrfs_ioctl_fs_info_args fs/btrfs/btrfs.ko
        struct btrfs_ioctl_fs_info_args {
      	  __u64                      max_id;               /*     0     8 */
      	  __u64                      num_devices;          /*     8     8 */
      	  __u8                       fsid[16];             /*    16    16 */
      	  __u32                      nodesize;             /*    32     4 */
      	  __u32                      sectorsize;           /*    36     4 */
      	  __u32                      clone_alignment;      /*    40     4 */
      	  __u16                      csum_type;            /*    44     2 */
      	  __u16                      csum_size;            /*    46     2 */
      	  __u64                      flags;                /*    48     8 */
      	  __u8                       reserved[968];        /*    56   968 */
      
      	  /* size: 1024, cachelines: 16, members: 10 */
        };
      
      Fixes: 3951e7f0 ("btrfs: add xxhash64 to checksumming algorithms")
      Fixes: 3831bf00 ("btrfs: add sha256 to checksumming algorithm")
      CC: stable@vger.kernel.org # 5.5+
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      137c5418
    • Qu Wenruo's avatar
      btrfs: qgroup: remove ASYNC_COMMIT mechanism in favor of reserve retry-after-EDQUOT · adca4d94
      Qu Wenruo authored
      commit a514d638 ("btrfs: qgroup: Commit transaction in advance to
      reduce early EDQUOT") tries to reduce the early EDQUOT problems by
      checking the qgroup free against threshold and tries to wake up commit
      kthread to free some space.
      
      The problem of that mechanism is, it can only free qgroup per-trans
      metadata space, can't do anything to data, nor prealloc qgroup space.
      
      Now since we have the ability to flush qgroup space, and implemented
      retry-after-EDQUOT behavior, such mechanism can be completely replaced.
      
      So this patch will cleanup such mechanism in favor of
      retry-after-EDQUOT.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      adca4d94
    • Qu Wenruo's avatar
      btrfs: qgroup: try to flush qgroup space when we get -EDQUOT · c53e9653
      Qu Wenruo authored
      [PROBLEM]
      There are known problem related to how btrfs handles qgroup reserved
      space.  One of the most obvious case is the the test case btrfs/153,
      which do fallocate, then write into the preallocated range.
      
        btrfs/153 1s ... - output mismatch (see xfstests-dev/results//btrfs/153.out.bad)
            --- tests/btrfs/153.out     2019-10-22 15:18:14.068965341 +0800
            +++ xfstests-dev/results//btrfs/153.out.bad      2020-07-01 20:24:40.730000089 +0800
            @@ -1,2 +1,5 @@
             QA output created by 153
            +pwrite: Disk quota exceeded
            +/mnt/scratch/testfile2: Disk quota exceeded
            +/mnt/scratch/testfile2: Disk quota exceeded
             Silence is golden
            ...
            (Run 'diff -u xfstests-dev/tests/btrfs/153.out xfstests-dev/results//btrfs/153.out.bad'  to see the entire diff)
      
      [CAUSE]
      Since commit c6887cd1 ("Btrfs: don't do nocow check unless we have to"),
      we always reserve space no matter if it's COW or not.
      
      Such behavior change is mostly for performance, and reverting it is not
      a good idea anyway.
      
      For preallcoated extent, we reserve qgroup data space for it already,
      and since we also reserve data space for qgroup at buffered write time,
      it needs twice the space for us to write into preallocated space.
      
      This leads to the -EDQUOT in buffered write routine.
      
      And we can't follow the same solution, unlike data/meta space check,
      qgroup reserved space is shared between data/metadata.
      The EDQUOT can happen at the metadata reservation, so doing NODATACOW
      check after qgroup reservation failure is not a solution.
      
      [FIX]
      To solve the problem, we don't return -EDQUOT directly, but every time
      we got a -EDQUOT, we try to flush qgroup space:
      
      - Flush all inodes of the root
        NODATACOW writes will free the qgroup reserved at run_dealloc_range().
        However we don't have the infrastructure to only flush NODATACOW
        inodes, here we flush all inodes anyway.
      
      - Wait for ordered extents
        This would convert the preallocated metadata space into per-trans
        metadata, which can be freed in later transaction commit.
      
      - Commit transaction
        This will free all per-trans metadata space.
      
      Also we don't want to trigger flush multiple times, so here we introduce
      a per-root wait list and a new root status, to ensure only one thread
      starts the flushing.
      
      Fixes: c6887cd1 ("Btrfs: don't do nocow check unless we have to")
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c53e9653
    • Qu Wenruo's avatar
      btrfs: qgroup: allow to unreserve range without releasing other ranges · 263da812
      Qu Wenruo authored
      [PROBLEM]
      Before this patch, when btrfs_qgroup_reserve_data() fails, we free all
      reserved space of the changeset.
      
      For example:
      	ret = btrfs_qgroup_reserve_data(inode, changeset, 0, SZ_1M);
      	ret = btrfs_qgroup_reserve_data(inode, changeset, SZ_1M, SZ_1M);
      	ret = btrfs_qgroup_reserve_data(inode, changeset, SZ_2M, SZ_1M);
      
      If the last btrfs_qgroup_reserve_data() failed, it will release the
      entire [0, 3M) range.
      
      This behavior is kind of OK for now, as when we hit -EDQUOT, we normally
      go error handling and need to release all reserved ranges anyway.
      
      But this also means the following call is not possible:
      
      	ret = btrfs_qgroup_reserve_data();
      	if (ret == -EDQUOT) {
      		/* Do something to free some qgroup space */
      		ret = btrfs_qgroup_reserve_data();
      	}
      
      As if the first btrfs_qgroup_reserve_data() fails, it will free all
      reserved qgroup space.
      
      [CAUSE]
      This is because we release all reserved ranges when
      btrfs_qgroup_reserve_data() fails.
      
      [FIX]
      This patch will implement a new function, qgroup_unreserve_range(), to
      iterate through the ulist nodes, to find any nodes in the failure range,
      and remove the EXTENT_QGROUP_RESERVED bits from the io_tree, and
      decrease the extent_changeset::bytes_changed, so that we can revert to
      previous state.
      
      This allows later patches to retry btrfs_qgroup_reserve_data() if EDQUOT
      happens.
      Suggested-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      263da812
    • Josef Bacik's avatar
      btrfs: convert block group refcount to refcount_t · 48aaeebe
      Josef Bacik authored
      We have refcount_t now with the associated library to handle refcounts,
      which gives us extra debugging around reference count mistakes that may
      be made.  For example it'll warn on any transition from 0->1 or 0->-1,
      which is handy for noticing cases where we've messed up reference
      counting.  Convert the block group ref counting from an atomic_t to
      refcount_t and use the appropriate helpers.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      48aaeebe
    • Marcos Paulo de Souza's avatar
      btrfs: add multi-statement protection to btrfs_set/clear_and_info macros · 60f8667b
      Marcos Paulo de Souza authored
      Multi-statement macros should be enclosed in do/while(0) block to make
      their use safe in single statement if conditions. All current uses of
      the macros are safe, so this change is for future protection.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarMarcos Paulo de Souza <mpdesouza@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      60f8667b
    • Nikolay Borisov's avatar
    • Nikolay Borisov's avatar
    • Nikolay Borisov's avatar
      btrfs: raid56: use in_range where applicable · 83025863
      Nikolay Borisov authored
      While at it use the opportunity to simplify find_logical_bio_stripe by
      reducing the scope of 'stripe_start' variable and squash the
      sector-to-bytes conversion on one line.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      83025863
    • Nikolay Borisov's avatar
      btrfs: raid56: assign bio in while() when using bio_list_pop · bf28a605
      Nikolay Borisov authored
      Unify the style in the file such that return value of bio_list_pop is
      assigned directly in the while loop. This is in line with the rest of
      the kernel.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bf28a605
    • Nikolay Borisov's avatar
      btrfs: raid56: remove redundant device check in rbio_add_io_page · f90ae76a
      Nikolay Borisov authored
      The merging logic is always executed if the current stripe's device
      is not missing. So there's no point in duplicating the check. Simply
      remove it, while at it reduce the scope of the 'last_end' variable.
      If the current stripe's device is missing we fail the stripe early on.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f90ae76a
    • Nikolay Borisov's avatar
      btrfs: always initialize btrfs_bio::tgtdev_map/raid_map pointers · 608769a4
      Nikolay Borisov authored
      Since btrfs_bio always contains the extra space for the tgtdev_map and
      raid_maps it's pointless to make the assignment iff specific conditions
      are met.
      
      Instead, always assign the pointers to their correct value at allocation
      time. To accommodate this change also move code a bit in
      __btrfs_map_block so that btrfs_bio::stripes array is always initialized
      before the raid_map, subsequently move the call to sort_parity_stripes
      in the 'if' building the raid_map, retaining the old behavior.
      
      To better understand the change, there are 2 aspects to this:
      
      1. The original code is harder to grasp because the calculations for
         initializing raid_map/tgtdev ponters are apart from the initial
         allocation of memory. Having them predicated on 2 separate checks
         doesn't help that either... So by moving the initialisation in
         alloc_btrfs_bio puts everything together.
      
      2. tgtdev/raid_maps are now always initialized despite sometimes they
         might be equal i.e __btrfs_map_block_for_discard calls
         alloc_btrfs_bio with tgtdev = 0 but their usage should be predicated
         on external checks i.e. just because those pointers are non-null
         doesn't mean they are valid per-se. And actually while taking another
         look at __btrfs_map_block I saw a discrepancy:
      
         Original code initialised tgtdev_map if the following check is true:
      
      	   if (dev_replace_is_ongoing && dev_replace->tgtdev != NULL)
      
         However, further down tgtdev_map is only used if the following check
         is true:
      
      	if (dev_replace_is_ongoing && dev_replace->tgtdev != NULL && need_full_stripe(op))
      
        e.g. the additional need_full_stripe(op) predicate is there.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ copy more details from mail discussion ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      608769a4
    • Nikolay Borisov's avatar
      btrfs: sysfs: add bdi link to the fsid directory · 3092c68f
      Nikolay Borisov authored
      Since BTRFS uses a private bdi it makes sense to create a link to this
      bdi under /sys/fs/btrfs/<UUID>/bdi. This allows size of read ahead to
      be controlled. Without this patch it's not possible to uniquely identify
      which bdi pertains to which btrfs filesystem in the case of multiple
      btrfs filesystems.
      
      It's fine to simply call sysfs_remove_link without checking if the
      link indeed has been created. The call path
      
      sysfs_remove_link
       kernfs_remove_by_name
        kernfs_remove_by_name_ns
      
      will simply return -ENOENT in case it doesn't exist.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3092c68f
    • Nikolay Borisov's avatar
      btrfs: increment corrupt device counter during compressed read · 5a9472fe
      Nikolay Borisov authored
      If a compressed read fails due to checksum error only a line is printed
      to dmesg, device corrupt counter is not modified.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5a9472fe
    • Nikolay Borisov's avatar
      btrfs: remove needless ASSERT check of orig_bio in end_compressed_bio_read · 26056eab
      Nikolay Borisov authored
      compressed_bio::orig_bio is always set in btrfs_submit_compressed_read
      before any bio submission is performed. Since that function is always
      called with a valid bio it renders the ASSERT unnecessary.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      26056eab
    • Nikolay Borisov's avatar
      btrfs: increment device corruption error in case of checksum error · 814723e0
      Nikolay Borisov authored
      Now that btrfs_io_bio have access to btrfs_device we can safely
      increment the device corruption counter on error. There is one notable
      exception - repair bios for raid. Since those don't go through the
      normal submit_stripe_bio callpath but through raid56_parity_recover thus
      repair bios won't have their device set.
      
      Scrub increments the corruption counter for checksum mismatch as well
      but does not call this function.
      
      Link: https://lore.kernel.org/linux-btrfs/4857863.FCrPRfMyHP@liv/Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      814723e0
    • Nikolay Borisov's avatar
      btrfs: don't check for btrfs_device::bdev in btrfs_end_bio · 3eee86c8
      Nikolay Borisov authored
      btrfs_map_bio ensures that all submitted bios to devices have valid
      btrfs_device::bdev so this check can be removed from btrfs_end_bio. This
      check was added in june 2012 597a60fa ("Btrfs: don't count I/O
      statistic read errors for missing devices") but then in October of the
      same year another commit de1ee92a ("Btrfs: recheck bio against
      block device when we map the bio") started checking for the presence of
      btrfs_device::bdev before actually issuing the bio.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3eee86c8
    • Nikolay Borisov's avatar
      btrfs: record btrfs_device directly in btrfs_io_bio · c31efbdf
      Nikolay Borisov authored
      Instead of recording stripe_index and using that to access correct
      btrfs_device from btrfs_bio::stripes record the btrfs_device in
      btrfs_io_bio. This will enable endio handlers to increment device
      error counters on checksum errors.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c31efbdf
    • Nikolay Borisov's avatar
      btrfs: streamline btrfs_get_io_failure_record logic · 3526302f
      Nikolay Borisov authored
      Make the function directly return a pointer to a failure record and
      adjust callers to handle it. Also refactor the logic inside so that
      the case which allocates the failure record for the first time is not
      handled in an 'if' arm, saving us a level of indentation. Finally make
      the function static as it's not used outside of extent_io.c .
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3526302f
    • Nikolay Borisov's avatar
      btrfs: make get_state_failrec return failrec directly · 2279a270
      Nikolay Borisov authored
      Only failure that get_state_failrec can get is if there is no failure
      for the given address. There is no reason why the function should return
      a status code and use a separate parameter for returning the actual
      failure rec (if one is found). Simplify it by making the return type
      a pointer and return ERR_PTR value in case of errors.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2279a270
    • David Sterba's avatar
      btrfs: remove deprecated mount option subvolrootid · b90a4ab6
      David Sterba authored
      The option subvolrootid used to be a workaround for mounting subvolumes
      and ineffective since 5e2a4b25 ("btrfs: deprecate subvolrootid mount
      option"). We have subvol= that works and we don't need to keep the
      cruft, let's remove it.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b90a4ab6
    • David Sterba's avatar
      btrfs: remove deprecated mount option alloc_start · d801e7a3
      David Sterba authored
      The mount option alloc_start has no effect since 0d0c71b3 ("btrfs:
      obsolete and remove mount option alloc_start") which has details why
      it's been deprecated. We can remove it.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d801e7a3
    • Filipe Manana's avatar
      btrfs: remove no longer needed use of log_writers for the log root tree · a93e0168
      Filipe Manana authored
      When syncing the log, we used to update the log root tree without holding
      neither the log_mutex of the subvolume root nor the log_mutex of log root
      tree.
      
      We used to have two critical sections delimited by the log_mutex of the
      log root tree, so in the first one we incremented the log_writers of the
      log root tree and on the second one we decremented it and waited for the
      log_writers counter to go down to zero. This was because the update of
      the log root tree happened between the two critical sections.
      
      The use of two critical sections allowed a little bit more of parallelism
      and required the use of the log_writers counter, necessary to make sure
      we didn't miss any log root tree update when we have multiple tasks trying
      to sync the log in parallel.
      
      However after commit 06989c79 ("Btrfs: fix race updating log root
      item during fsync") the log root tree update was moved into a critical
      section delimited by the subvolume's log_mutex. Later another commit
      moved the log tree update from that critical section into the second
      critical section delimited by the log_mutex of the log root tree. Both
      commits addressed different bugs.
      
      The end result is that the first critical section delimited by the
      log_mutex of the log root tree became pointless, since there's nothing
      done between it and the second critical section, we just have an unlock
      of the log_mutex followed by a lock operation. This means we can merge
      both critical sections, as the first one does almost nothing now, and we
      can stop using the log_writers counter of the log root tree, which was
      incremented in the first critical section and decremented in the second
      criticial section, used to make sure no one in the second critical section
      started writeback of the log root tree before some other task updated it.
      
      So just remove the mutex_unlock() followed by mutex_lock() of the log root
      tree, as well as the use of the log_writers counter for the log root tree.
      
      This patch is part of a series that has the following patches:
      
      1/4 btrfs: only commit the delayed inode when doing a full fsync
      2/4 btrfs: only commit delayed items at fsync if we are logging a directory
      3/4 btrfs: stop incremening log_batch for the log root tree when syncing log
      4/4 btrfs: remove no longer needed use of log_writers for the log root tree
      
      After the entire patchset applied I saw about 12% decrease on max latency
      reported by dbench. The test was done on a qemu vm, with 8 cores, 16Gb of
      ram, using kvm and using a raw NVMe device directly (no intermediary fs on
      the host). The test was invoked like the following:
      
        mkfs.btrfs -f /dev/sdk
        mount -o ssd -o nospace_cache /dev/sdk /mnt/sdk
        dbench -D /mnt/sdk -t 300 8
        umount /mnt/dsk
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a93e0168
    • Filipe Manana's avatar
      btrfs: stop incremening log_batch for the log root tree when syncing log · 28a95795
      Filipe Manana authored
      We are incrementing the log_batch atomic counter of the root log tree but
      we never use that counter, it's used only for the log trees of subvolume
      roots. We started doing it when we moved the log_batch and log_write
      counters from the global, per fs, btrfs_fs_info structure, into the
      btrfs_root structure in commit 7237f183 ("Btrfs: fix tree logs
      parallel sync").
      
      So just stop doing it for the log root tree and add a comment over the
      field declaration so inform it's used only for log trees of subvolume
      roots.
      
      This patch is part of a series that has the following patches:
      
      1/4 btrfs: only commit the delayed inode when doing a full fsync
      2/4 btrfs: only commit delayed items at fsync if we are logging a directory
      3/4 btrfs: stop incremening log_batch for the log root tree when syncing log
      4/4 btrfs: remove no longer needed use of log_writers for the log root tree
      
      After the entire patchset applied I saw about 12% decrease on max latency
      reported by dbench. The test was done on a qemu vm, with 8 cores, 16Gb of
      ram, using kvm and using a raw NVMe device directly (no intermediary fs on
      the host). The test was invoked like the following:
      
        mkfs.btrfs -f /dev/sdk
        mount -o ssd -o nospace_cache /dev/sdk /mnt/sdk
        dbench -D /mnt/sdk -t 300 8
        umount /mnt/dsk
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      28a95795
    • Filipe Manana's avatar
      btrfs: only commit delayed items at fsync if we are logging a directory · 5aa7d1a7
      Filipe Manana authored
      When logging an inode we are committing its delayed items if either the
      inode is a directory or if it is a new inode, created in the current
      transaction.
      
      We need to do it for directories, since new directory indexes are stored
      as delayed items of the inode and when logging a directory we need to be
      able to access all indexes from the fs/subvolume tree in order to figure
      out which index ranges need to be logged.
      
      However for new inodes that are not directories, we do not need to do it
      because the only type of delayed item they can have is the inode item, and
      we are guaranteed to always log an up to date version of the inode item:
      
      *) for a full fsync we do it by committing the delayed inode and then
         copying the item from the fs/subvolume tree with
         copy_inode_items_to_log();
      
      *) for a fast fsync we always log the inode item based on the contents of
         the in-memory struct btrfs_inode. We guarantee this is always done since
         commit e4545de5 ("Btrfs: fix fsync data loss after append write").
      
      So stop running delayed items for a new inodes that are not directories,
      since that forces committing the delayed inode into the fs/subvolume tree,
      wasting time and adding contention to the tree when a full fsync is not
      required. We will only do it in case a fast fsync is needed.
      
      This patch is part of a series that has the following patches:
      
      1/4 btrfs: only commit the delayed inode when doing a full fsync
      2/4 btrfs: only commit delayed items at fsync if we are logging a directory
      3/4 btrfs: stop incremening log_batch for the log root tree when syncing log
      4/4 btrfs: remove no longer needed use of log_writers for the log root tree
      
      After the entire patchset applied I saw about 12% decrease on max latency
      reported by dbench. The test was done on a qemu vm, with 8 cores, 16Gb of
      ram, using kvm and using a raw NVMe device directly (no intermediary fs on
      the host). The test was invoked like the following:
      
        mkfs.btrfs -f /dev/sdk
        mount -o ssd -o nospace_cache /dev/sdk /mnt/sdk
        dbench -D /mnt/sdk -t 300 8
        umount /mnt/dsk
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5aa7d1a7
    • Filipe Manana's avatar
      btrfs: only commit the delayed inode when doing a full fsync · 8c8648dd
      Filipe Manana authored
      Commit 2c2c452b ("Btrfs: fix fsync when extend references are added
      to an inode") forced a commit of the delayed inode when logging an inode
      in order to ensure we would end up logging the inode item during a full
      fsync. By committing the delayed inode, we updated the inode item in the
      fs/subvolume tree and then later when copying items from leafs modified in
      the current transaction into the log tree (with copy_inode_items_to_log())
      we ended up copying the inode item from the fs/subvolume tree into the log
      tree. Logging an up to date version of the inode item is required to make
      sure at log replay time we get the link count fixup triggered among other
      things (replay xattr deletes, etc). The test case generic/040 from fstests
      exercises the bug which that commit fixed.
      
      However for a fast fsync we don't need to commit the delayed inode because
      we always log an up to date version of the inode item based on the struct
      btrfs_inode we have in-memory. We started doing this for fast fsyncs since
      commit e4545de5 ("Btrfs: fix fsync data loss after append write").
      
      So just stop committing the delayed inode if we are doing a fast fsync,
      we are only wasting time and adding contention on fs/subvolume tree.
      
      This patch is part of a series that has the following patches:
      
      1/4 btrfs: only commit the delayed inode when doing a full fsync
      2/4 btrfs: only commit delayed items at fsync if we are logging a directory
      3/4 btrfs: stop incremening log_batch for the log root tree when syncing log
      4/4 btrfs: remove no longer needed use of log_writers for the log root tree
      
      After the entire patchset applied I saw about 12% decrease on max latency
      reported by dbench. The test was done on a qemu vm, with 8 cores, 16Gb of
      ram, using kvm and using a raw NVMe device directly (no intermediary fs on
      the host). The test was invoked like the following:
      
        mkfs.btrfs -f /dev/sdk
        mount -o ssd -o nospace_cache /dev/sdk /mnt/sdk
        dbench -D /mnt/sdk -t 300 8
        umount /mnt/dsk
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8c8648dd
    • Qu Wenruo's avatar
      btrfs: preallocate anon block device at first phase of snapshot creation · 2dfb1e43
      Qu Wenruo authored
      [BUG]
      When the anonymous block device pool is exhausted, subvolume/snapshot
      creation fails with EMFILE (Too many files open). This has been reported
      by a user. The allocation happens in the second phase during transaction
      commit where it's only way out is to abort the transaction
      
        BTRFS: Transaction aborted (error -24)
        WARNING: CPU: 17 PID: 17041 at fs/btrfs/transaction.c:1576 create_pending_snapshot+0xbc4/0xd10 [btrfs]
        RIP: 0010:create_pending_snapshot+0xbc4/0xd10 [btrfs]
        Call Trace:
         create_pending_snapshots+0x82/0xa0 [btrfs]
         btrfs_commit_transaction+0x275/0x8c0 [btrfs]
         btrfs_mksubvol+0x4b9/0x500 [btrfs]
         btrfs_ioctl_snap_create_transid+0x174/0x180 [btrfs]
         btrfs_ioctl_snap_create_v2+0x11c/0x180 [btrfs]
         btrfs_ioctl+0x11a4/0x2da0 [btrfs]
         do_vfs_ioctl+0xa9/0x640
         ksys_ioctl+0x67/0x90
         __x64_sys_ioctl+0x1a/0x20
         do_syscall_64+0x5a/0x110
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        ---[ end trace 33f2f83f3d5250e9 ]---
        BTRFS: error (device sda1) in create_pending_snapshot:1576: errno=-24 unknown
        BTRFS info (device sda1): forced readonly
        BTRFS warning (device sda1): Skipping commit of aborted transaction.
        BTRFS: error (device sda1) in cleanup_transaction:1831: errno=-24 unknown
      
      [CAUSE]
      When the global anonymous block device pool is exhausted, the following
      call chain will fail, and lead to transaction abort:
      
       btrfs_ioctl_snap_create_v2()
       |- btrfs_ioctl_snap_create_transid()
          |- btrfs_mksubvol()
             |- btrfs_commit_transaction()
                |- create_pending_snapshot()
                   |- btrfs_get_fs_root()
                      |- btrfs_init_fs_root()
                         |- get_anon_bdev()
      
      [FIX]
      Although we can't enlarge the anonymous block device pool, at least we
      can preallocate anon_dev for subvolume/snapshot in the first phase,
      outside of transaction context and exactly at the moment the user calls
      the creation ioctl.
      Reported-by: default avatarGreed Rong <greedrong@gmail.com>
      Link: https://lore.kernel.org/linux-btrfs/CA+UqX+NTrZ6boGnWHhSeZmEY5J76CTqmYjO2S+=tHJX7nb9DPw@mail.gmail.com/
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2dfb1e43
    • Qu Wenruo's avatar
      btrfs: free anon block device right after subvolume deletion · 082b6c97
      Qu Wenruo authored
      [BUG]
      When a lot of subvolumes are created, there is a user report about
      transaction aborted caused by slow anonymous block device reclaim:
      
        BTRFS: Transaction aborted (error -24)
        WARNING: CPU: 17 PID: 17041 at fs/btrfs/transaction.c:1576 create_pending_snapshot+0xbc4/0xd10 [btrfs]
        RIP: 0010:create_pending_snapshot+0xbc4/0xd10 [btrfs]
        Call Trace:
         create_pending_snapshots+0x82/0xa0 [btrfs]
         btrfs_commit_transaction+0x275/0x8c0 [btrfs]
         btrfs_mksubvol+0x4b9/0x500 [btrfs]
         btrfs_ioctl_snap_create_transid+0x174/0x180 [btrfs]
         btrfs_ioctl_snap_create_v2+0x11c/0x180 [btrfs]
         btrfs_ioctl+0x11a4/0x2da0 [btrfs]
         do_vfs_ioctl+0xa9/0x640
         ksys_ioctl+0x67/0x90
         __x64_sys_ioctl+0x1a/0x20
         do_syscall_64+0x5a/0x110
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        ---[ end trace 33f2f83f3d5250e9 ]---
        BTRFS: error (device sda1) in create_pending_snapshot:1576: errno=-24 unknown
        BTRFS info (device sda1): forced readonly
        BTRFS warning (device sda1): Skipping commit of aborted transaction.
        BTRFS: error (device sda1) in cleanup_transaction:1831: errno=-24 unknown
      
      [CAUSE]
      The anonymous device pool is shared and its size is 1M. It's possible to
      hit that limit if the subvolume deletion is not fast enough and the
      subvolumes to be cleaned keep the ids allocated.
      
      [WORKAROUND]
      We can't avoid the anon device pool exhaustion but we can shorten the
      time the id is attached to the subvolume root once the subvolume becomes
      invisible to the user.
      Reported-by: default avatarGreed Rong <greedrong@gmail.com>
      Link: https://lore.kernel.org/linux-btrfs/CA+UqX+NTrZ6boGnWHhSeZmEY5J76CTqmYjO2S+=tHJX7nb9DPw@mail.gmail.com/
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      082b6c97
    • Qu Wenruo's avatar
      btrfs: don't allocate anonymous block device for user invisible roots · 851fd730
      Qu Wenruo authored
      [BUG]
      When a lot of subvolumes are created, there is a user report about
      transaction aborted:
      
        BTRFS: Transaction aborted (error -24)
        WARNING: CPU: 17 PID: 17041 at fs/btrfs/transaction.c:1576 create_pending_snapshot+0xbc4/0xd10 [btrfs]
        RIP: 0010:create_pending_snapshot+0xbc4/0xd10 [btrfs]
        Call Trace:
         create_pending_snapshots+0x82/0xa0 [btrfs]
         btrfs_commit_transaction+0x275/0x8c0 [btrfs]
         btrfs_mksubvol+0x4b9/0x500 [btrfs]
         btrfs_ioctl_snap_create_transid+0x174/0x180 [btrfs]
         btrfs_ioctl_snap_create_v2+0x11c/0x180 [btrfs]
         btrfs_ioctl+0x11a4/0x2da0 [btrfs]
         do_vfs_ioctl+0xa9/0x640
         ksys_ioctl+0x67/0x90
         __x64_sys_ioctl+0x1a/0x20
         do_syscall_64+0x5a/0x110
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        ---[ end trace 33f2f83f3d5250e9 ]---
        BTRFS: error (device sda1) in create_pending_snapshot:1576: errno=-24 unknown
        BTRFS info (device sda1): forced readonly
        BTRFS warning (device sda1): Skipping commit of aborted transaction.
        BTRFS: error (device sda1) in cleanup_transaction:1831: errno=-24 unknown
      
      [CAUSE]
      The error is EMFILE (Too many files open) and comes from the anonymous
      block device allocation. The ids are in a shared pool of size 1<<20.
      
      The ids are assigned to live subvolumes, ie. the root structure exists
      in memory (eg. after creation or after the root appears in some path).
      The pool could be exhausted if the numbers are not reclaimed fast
      enough, after subvolume deletion or if other system component uses the
      anon block devices.
      
      [WORKAROUND]
      Since it's not possible to completely solve the problem, we can only
      minimize the time the id is allocated to a subvolume root.
      
      Firstly, we can reduce the use of anon_dev by trees that are not
      subvolume roots, like data reloc tree.
      
      This patch will do extra check on root objectid, to skip roots that
      don't need anon_dev.  Currently it's only data reloc tree and orphan
      roots.
      Reported-by: default avatarGreed Rong <greedrong@gmail.com>
      Link: https://lore.kernel.org/linux-btrfs/CA+UqX+NTrZ6boGnWHhSeZmEY5J76CTqmYjO2S+=tHJX7nb9DPw@mail.gmail.com/
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      851fd730
    • Qu Wenruo's avatar
      btrfs: qgroup: export qgroups in sysfs · 49e5fb46
      Qu Wenruo authored
      This patch will add the following sysfs interface:
      
        /sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/referenced
        /sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/exclusive
        /sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/max_referenced
        /sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/max_exclusive
        /sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/limit_flags
      
      Which is also available in output of "btrfs qgroup show".
      
        /sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/rsv_data
        /sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/rsv_meta_pertrans
        /sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/rsv_meta_prealloc
      
      The last 3 rsv related members are not visible to users, but can be very
      useful to debug qgroup limit related bugs.
      
      Also, to avoid '/' used in <qgroup_id>, the separator between qgroup
      level and qgroup id is changed to '_'.
      
      The interface is not hidden behind 'debug' as we want this interface to
      be included into production build and to provide another way to read the
      qgroup information besides the ioctls.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      49e5fb46
    • Qu Wenruo's avatar
      btrfs: use __u16 for the return value of btrfs_qgroup_level() · 06f67c47
      Qu Wenruo authored
      The qgroup level is limited to u16, so no need to use u64 for it.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      06f67c47
    • Nikolay Borisov's avatar
      btrfs: make btrfs_qgroup_check_reserved_leak take btrfs_inode · cfdd4592
      Nikolay Borisov authored
      vfs_inode is used only for the inode number everything else requires
      btrfs_inode.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ use btrfs_ino ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cfdd4592
    • Nikolay Borisov's avatar
      btrfs: make btrfs_set_inode_last_trans take btrfs_inode · d9094414
      Nikolay Borisov authored
      Instead of making multiple calls to BTRFS_I simply take btrfs_inode as
      an input paramter.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d9094414
    • Nikolay Borisov's avatar
      btrfs: make prealloc_file_extent_cluster take btrfs_inode · 056d9bec
      Nikolay Borisov authored
      The vfs inode is only used for a pair of inode_lock/unlock calls all
      other uses call for btrfs_inode.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      056d9bec
    • Nikolay Borisov's avatar
      btrfs: remove BTRFS_I calls in btrfs_writepage_fixup_worker · 65d87f79
      Nikolay Borisov authored
      All of its children functions use btrfs_inode.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      65d87f79
    • Nikolay Borisov's avatar
      btrfs: make btrfs_delalloc_reserve_space take btrfs_inode · e5b7231e
      Nikolay Borisov authored
      All of its children take btrfs_inode so bubble up this requirement to
      btrfs_delalloc_reserve_space's interface and stop calling BTRFS_I
      internally.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e5b7231e
    • Nikolay Borisov's avatar
      btrfs: make btrfs_check_data_free_space take btrfs_inode · 36ea6f3e
      Nikolay Borisov authored
      Instead of calling BTRFS_I on the passed vfs_inode take btrfs_inode
      directly.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      36ea6f3e