1. 17 May, 2022 5 commits
    • Qu Wenruo's avatar
      btrfs: allow defrag to convert inline extents to regular extents · d8101a0c
      Qu Wenruo authored
      Btrfs defaults to max_inline=2K to make small writes inlined into
      metadata.
      
      The default value is always a win, as even DUP/RAID1/RAID10 doubles the
      metadata usage, it should still cause less physical space used compared
      to a 4K regular extents.
      
      But since the introduction of RAID1C3 and RAID1C4 it's no longer the case,
      users may find inlined extents causing too much space wasted, and want
      to convert those inlined extents back to regular extents.
      
      Unfortunately defrag will unconditionally skip all inline extents, no
      matter if the user is trying to converting them back to regular extents.
      
      So this patch will add a small exception for defrag_collect_targets() to
      allow defragging inline extents, if and only if the inlined extents are
      larger than max_inline, allowing users to convert them to regular ones.
      
      This also allows us to defrag extents like the following:
      
      	item 6 key (257 EXTENT_DATA 0) itemoff 15794 itemsize 69
      		generation 7 type 0 (inline)
      		inline extent data size 48 ram_bytes 4096 compression 1 (zlib)
      	item 7 key (257 EXTENT_DATA 4096) itemoff 15741 itemsize 53
      		generation 7 type 1 (regular)
      		extent data disk byte 13631488 nr 4096
      		extent data offset 0 nr 16384 ram 16384
      		extent compression 1 (zlib)
      
      Previously we're unable to do any defrag, since the first extent is
      inlined, and the second one has no extent to merge.
      
      Now we can defrag it to just one single extent, saving 48 bytes metadata
      space.
      
      	item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53
      		generation 8 type 1 (regular)
      		extent data disk byte 13635584 nr 4096
      		extent data offset 0 nr 20480 ram 20480
      		extent compression 1 (zlib)
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d8101a0c
    • Qu Wenruo's avatar
      btrfs: add "0x" prefix for unsupported optional features · d5321a0f
      Qu Wenruo authored
      The following error message lack the "0x" obviously:
      
        cannot mount because of unsupported optional features (4000)
      
      Add the prefix to make it less confusing. This can happen on older
      kernels that try to mount a filesystem with newer features so it makes
      sense to backport to older trees.
      
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d5321a0f
    • Filipe Manana's avatar
      btrfs: do not account twice for inode ref when reserving metadata units · 97bdf1a9
      Filipe Manana authored
      When reserving metadata units for creating an inode, we don't need to
      reserve one extra unit for the inode ref item because when creating the
      inode, at btrfs_create_new_inode(), we always insert the inode item and
      the inode ref item in a single batch (a single btree insert operation,
      and both ending up in the same leaf).
      
      As we have accounted already one unit for the inode item, the extra unit
      for the inode ref item is superfluous, it only makes us reserve more
      metadata than necessary and often adding more reclaim pressure if we are
      low on available metadata space.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      97bdf1a9
    • Naohiro Aota's avatar
      btrfs: zoned: fix comparison of alloc_offset vs meta_write_pointer · aa9ffadf
      Naohiro Aota authored
      The block_group->alloc_offset is an offset from the start of the block
      group. OTOH, the ->meta_write_pointer is an address in the logical
      space. So, we should compare the alloc_offset shifted with the
      block_group->start.
      
      Fixes: afba2bc0 ("btrfs: zoned: implement active zone tracking")
      CC: stable@vger.kernel.org # 5.16+
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      aa9ffadf
    • Filipe Manana's avatar
      btrfs: send: avoid trashing the page cache · 152555b3
      Filipe Manana authored
      A send operation reads extent data using the buffered IO path for getting
      extent data to send in write commands and this is both because it's simple
      and to make use of the generic readahead infrastructure, which results in
      a massive speedup.
      
      However this fills the page cache with data that, most of the time, is
      really only used by the send operation - once the write commands are sent,
      it's not useful to have the data in the page cache anymore. For large
      snapshots, bringing all data into the page cache eventually leads to the
      need to evict other data from the page cache that may be more useful for
      applications (and kernel subsystems).
      
      Even if extents are shared with the subvolume on which a snapshot is based
      on and the data is currently on the page cache due to being read through
      the subvolume, attempting to read the data through the snapshot will
      always result in bringing a new copy of the data into another location in
      the page cache (there's currently no shared memory for shared extents).
      
      So make send evict the data it has read before if when it first opened
      the inode, its mapping had no pages currently loaded: when
      inode->i_mapping->nr_pages has a value of 0. Do this instead of deciding
      based on the return value of filemap_range_has_page() before reading an
      extent because the generic readahead mechanism may read pages beyond the
      range we request (and it very often does it), which means a call to
      filemap_range_has_page() will return true due to the readahead that was
      triggered when processing a previous extent - we don't have a simple way
      to distinguish this case from the case where the data was brought into
      the page cache through someone else. So checking for the mapping number
      of pages being 0 when we first open the inode is simple, cheap and it
      generally accomplishes the goal of not trashing the page cache - the
      only exception is if part of data was previously loaded into the page
      cache through the snapshot by some other process, in that case we end
      up not evicting any data send brings into the page cache, just like
      before this change - but that however is not the common case.
      
      Example scenario, on a box with 32G of RAM:
      
        $ btrfs subvolume create /mnt/sv1
        $ xfs_io -f -c "pwrite 0 4G" /mnt/sv1/file1
      
        $ btrfs subvolume snapshot -r /mnt/sv1 /mnt/snap1
      
        $ free -m
                       total        used        free      shared  buff/cache   available
        Mem:           31937         186       26866           0        4883       31297
        Swap:           8188           0        8188
      
        # After this we get less 4G of free memory.
        $ btrfs send /mnt/snap1 >/dev/null
      
        $ free -m
                       total        used        free      shared  buff/cache   available
        Mem:           31937         186       22814           0        8935       31297
        Swap:           8188           0        8188
      
      The same, obviously, applies to an incremental send.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      152555b3
  2. 16 May, 2022 35 commits
    • Filipe Manana's avatar
      btrfs: send: keep the current inode open while processing it · 521b6803
      Filipe Manana authored
      Every time we send a write command, we open the inode, read some data to
      a buffer and then close the inode. The amount of data we read for each
      write command is at most 48K, returned by max_send_read_size(), and that
      corresponds to: BTRFS_SEND_BUF_SIZE - 16K = 48K. In practice this does
      not add any significant overhead, because the time elapsed between every
      close (iput()) and open (btrfs_iget()) is very short, so the inode is kept
      in the VFS's cache after the iput() and it's still there by the time we
      do the next btrfs_iget().
      
      As between processing extents of the current inode we don't do anything
      else, it makes sense to keep the inode open after we process its first
      extent that needs to be sent and keep it open until we start processing
      the next inode. This serves to facilitate the next change, which aims
      to avoid having send operations trash the page cache with data extents.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      521b6803
    • Christoph Hellwig's avatar
      btrfs: allocate the btrfs_dio_private as part of the iomap dio bio · 642c5d34
      Christoph Hellwig authored
      Create a new bio_set that contains all the per-bio private data needed
      by btrfs for direct I/O and tell the iomap code to use that instead
      of separately allocation the btrfs_dio_private structure.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      642c5d34
    • Christoph Hellwig's avatar
      btrfs: move struct btrfs_dio_private to inode.c · a3e171a0
      Christoph Hellwig authored
      The btrfs_dio_private structure is only used in inode.c, so move the
      definition there.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a3e171a0
    • Christoph Hellwig's avatar
      btrfs: remove the disk_bytenr in struct btrfs_dio_private · acb8b52a
      Christoph Hellwig authored
      This field is never used, so remove it. Last use was probably in
      23ea8e5a ("Btrfs: load checksum data once when submitting a direct
      read io").
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      acb8b52a
    • Christoph Hellwig's avatar
      btrfs: allocate dio_data on stack · 491a6d01
      Christoph Hellwig authored
      Make use of the new iomap_iter->private field to avoid a memory
      allocation per iomap range.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      491a6d01
    • Christoph Hellwig's avatar
      iomap: add per-iomap_iter private data · 786f847f
      Christoph Hellwig authored
      Allow the file system to keep state for all iterations.  For now only
      wire it up for direct I/O as there is an immediate need for it there.
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      786f847f
    • Christoph Hellwig's avatar
      iomap: allow the file system to provide a bio_set for direct I/O · 908c5490
      Christoph Hellwig authored
      Allow the file system to provide a specific bio_set for allocating
      direct I/O bios.  This will allow file systems that use the
      ->submit_io hook to stash away additional information for file system
      use.
      
      To make use of this additional space for information in the completion
      path, the file system needs to override the ->bi_end_io callback and
      then call back into iomap, so export iomap_dio_bio_end_io for that.
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      908c5490
    • Christoph Hellwig's avatar
      btrfs: add a btrfs_dio_rw wrapper · 36e8c622
      Christoph Hellwig authored
      Add a wrapper around iomap_dio_rw that keeps the direct I/O internals
      isolated in inode.c.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      36e8c622
    • Naohiro Aota's avatar
      btrfs: zoned: zone finish unused block group · 74e91b12
      Naohiro Aota authored
      While the active zones within an active block group are reset, and their
      active resource is released, the block group itself is kept in the active
      block group list and marked as active. As a result, the list will contain
      more than max_active_zones block groups. That itself is not fatal for the
      device as the zones are properly reset.
      
      However, that inflated list is, of course, strange. Also, a to-appear
      patch series, which deactivates an active block group on demand, gets
      confused with the wrong list.
      
      So, fix the issue by finishing the unused block group once it gets
      read-only, so that we can release the active resource in an early stage.
      
      Fixes: be1a1d7a ("btrfs: zoned: finish fully written block group")
      CC: stable@vger.kernel.org # 5.16+
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      74e91b12
    • Naohiro Aota's avatar
      btrfs: zoned: properly finish block group on metadata write · 56fbb0a4
      Naohiro Aota authored
      Commit be1a1d7a ("btrfs: zoned: finish fully written block group")
      introduced zone finishing code both for data and metadata end_io path.
      However, the metadata side is not working as it should. First, it
      compares logical address (eb->start + eb->len) with offset within a
      block group (cache->zone_capacity) in submit_eb_page(). That essentially
      disabled zone finishing on metadata end_io path.
      
      Furthermore, fixing the issue above revealed we cannot call
      btrfs_zone_finish_endio() in end_extent_buffer_writeback(). We cannot
      call btrfs_lookup_block_group() which require spin lock inside end_io
      context.
      
      Introduce btrfs_schedule_zone_finish_bg() to wait for the extent buffer
      writeback and do the zone finish IO in a workqueue.
      
      Also, drop EXTENT_BUFFER_ZONE_FINISH as it is no longer used.
      
      Fixes: be1a1d7a ("btrfs: zoned: finish fully written block group")
      CC: stable@vger.kernel.org # 5.16+
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      56fbb0a4
    • Naohiro Aota's avatar
      btrfs: zoned: finish block group when there are no more allocatable bytes left · 8b8a5399
      Naohiro Aota authored
      Currently, btrfs_zone_finish_endio() finishes a block group only when the
      written region reaches the end of the block group. We can also finish the
      block group when no more allocation is possible.
      
      Fixes: be1a1d7a ("btrfs: zoned: finish fully written block group")
      CC: stable@vger.kernel.org # 5.16+
      Reviewed-by: default avatarPankaj Raghav <p.raghav@samsung.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8b8a5399
    • Naohiro Aota's avatar
      btrfs: zoned: consolidate zone finish functions · d70cbdda
      Naohiro Aota authored
      btrfs_zone_finish() and btrfs_zone_finish_endio() have similar code.
      Introduce do_zone_finish() to factor out the common code.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d70cbdda
    • Naohiro Aota's avatar
      btrfs: zoned: introduce btrfs_zoned_bg_is_full · 1bfd4767
      Naohiro Aota authored
      Introduce a wrapper to check if all the space in a block group is
      allocated or not.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1bfd4767
    • Nikolay Borisov's avatar
      btrfs: improve error reporting in lookup_inline_extent_backref · cf4f03c3
      Nikolay Borisov authored
      When iterating the backrefs in an extent item if the ptr to the
      'current' backref record goes beyond the extent item a warning is
      generated and -ENOENT is returned. However what's more appropriate to
      debug such cases would be to return EUCLEAN and also print identifying
      information about the performed search as well as the current content of
      the leaf containing the possibly corrupted extent item.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cf4f03c3
    • David Sterba's avatar
      btrfs: rename bio_ctrl::bio_flags to compress_type · 0f07003b
      David Sterba authored
      The bio_ctrl is the last use of bio_flags that has been converted to
      compress type everywhere else.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0f07003b
    • David Sterba's avatar
      btrfs: rename bio_flags in parameters and switch type · cb3a12d9
      David Sterba authored
      Several functions take parameter bio_flags that was simplified to just
      compress type, unify it and change the type accordingly.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cb3a12d9
    • David Sterba's avatar
      btrfs: rename io_failure_record::bio_flags to compress_type · 0ff40013
      David Sterba authored
      The bio_flags is now used to store unchanged compress type, so unify
      that.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0ff40013
    • David Sterba's avatar
      btrfs: open code extent_set_compress_type helpers · 7f6ca7f2
      David Sterba authored
      The helpers extent_set_compress_type and extent_compress_type have
      become trivial after previous cleanups and can be removed.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7f6ca7f2
    • David Sterba's avatar
      btrfs: simplify handling of bio_ctrl::bio_flags · 2a5232a8
      David Sterba authored
      The bio_flags are used only to encode the compression and there are no
      other EXTENT_BIO_* flags, so the compress type can be stored directly.
      The struct member name is left unchanged and will be cleaned in later
      patches.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2a5232a8
    • David Sterba's avatar
      btrfs: remove trivial helper update_nr_written · 572f3dad
      David Sterba authored
      The helper used to do more with the wbc state but now it's just one
      subtraction, no need to have a special helper.
      
      It became trivial in a9132667 ("Btrfs: make mapping->writeback_index
      point to the last written page").
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      572f3dad
    • David Sterba's avatar
    • David Sterba's avatar
      btrfs: remove btrfs_delayed_extent_op::is_data · 0e3696f8
      David Sterba authored
      The value of btrfs_delayed_extent_op::is_data is always false, we can
      cascade the change and simplify code that depends on it, removing the
      structure member eventually.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0e3696f8
    • David Sterba's avatar
      btrfs: sink parameter is_data to btrfs_set_disk_extent_flags · 2fe6a5a1
      David Sterba authored
      The parameter has been added in 2009 in the infamous monster commit
      5d4f98a2 ("Btrfs: Mixed back reference  (FORWARD ROLLING FORMAT
      CHANGE)") but not used ever since. We can sink it and allow further
      simplifications.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2fe6a5a1
    • Filipe Manana's avatar
      btrfs: fix deadlock between concurrent dio writes when low on free data space · f5585f4f
      Filipe Manana authored
      When reserving data space for a direct IO write we can end up deadlocking
      if we have multiple tasks attempting a write to the same file range, there
      are multiple extents covered by that file range, we are low on available
      space for data and the writes don't expand the inode's i_size.
      
      The deadlock can happen like this:
      
      1) We have a file with an i_size of 1M, at offset 0 it has an extent with
         a size of 128K and at offset 128K it has another extent also with a
         size of 128K;
      
      2) Task A does a direct IO write against file range [0, 256K), and because
         the write is within the i_size boundary, it takes the inode's lock (VFS
         level) in shared mode;
      
      3) Task A locks the file range [0, 256K) at btrfs_dio_iomap_begin(), and
         then gets the extent map for the extent covering the range [0, 128K).
         At btrfs_get_blocks_direct_write(), it creates an ordered extent for
         that file range ([0, 128K));
      
      4) Before returning from btrfs_dio_iomap_begin(), it unlocks the file
         range [0, 256K);
      
      5) Task A executes btrfs_dio_iomap_begin() again, this time for the file
         range [128K, 256K), and locks the file range [128K, 256K);
      
      6) Task B starts a direct IO write against file range [0, 256K) as well.
         It also locks the inode in shared mode, as it's within the i_size limit,
         and then tries to lock file range [0, 256K). It is able to lock the
         subrange [0, 128K) but then blocks waiting for the range [128K, 256K),
         as it is currently locked by task A;
      
      7) Task A enters btrfs_get_blocks_direct_write() and tries to reserve data
         space. Because we are low on available free space, it triggers the
         async data reclaim task, and waits for it to reserve data space;
      
      8) The async reclaim task decides to wait for all existing ordered extents
         to complete (through btrfs_wait_ordered_roots()).
         It finds the ordered extent previously created by task A for the file
         range [0, 128K) and waits for it to complete;
      
      9) The ordered extent for the file range [0, 128K) can not complete
         because it blocks at btrfs_finish_ordered_io() when trying to lock the
         file range [0, 128K).
      
         This results in a deadlock, because:
      
         - task B is holding the file range [0, 128K) locked, waiting for the
           range [128K, 256K) to be unlocked by task A;
      
         - task A is holding the file range [128K, 256K) locked and it's waiting
           for the async data reclaim task to satisfy its space reservation
           request;
      
         - the async data reclaim task is waiting for ordered extent [0, 128K)
           to complete, but the ordered extent can not complete because the
           file range [0, 128K) is currently locked by task B, which is waiting
           on task A to unlock file range [128K, 256K) and task A waiting
           on the async data reclaim task.
      
         This results in a deadlock between 4 task: task A, task B, the async
         data reclaim task and the task doing ordered extent completion (a work
         queue task).
      
      This type of deadlock can sporadically be triggered by the test case
      generic/300 from fstests, and results in a stack trace like the following:
      
      [12084.033689] INFO: task kworker/u16:7:123749 blocked for more than 241 seconds.
      [12084.034877]       Not tainted 5.18.0-rc2-btrfs-next-115 #1
      [12084.035562] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [12084.036548] task:kworker/u16:7   state:D stack:    0 pid:123749 ppid:     2 flags:0x00004000
      [12084.036554] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
      [12084.036599] Call Trace:
      [12084.036601]  <TASK>
      [12084.036606]  __schedule+0x3cb/0xed0
      [12084.036616]  schedule+0x4e/0xb0
      [12084.036620]  btrfs_start_ordered_extent+0x109/0x1c0 [btrfs]
      [12084.036651]  ? prepare_to_wait_exclusive+0xc0/0xc0
      [12084.036659]  btrfs_run_ordered_extent_work+0x1a/0x30 [btrfs]
      [12084.036688]  btrfs_work_helper+0xf8/0x400 [btrfs]
      [12084.036719]  ? lock_is_held_type+0xe8/0x140
      [12084.036727]  process_one_work+0x252/0x5a0
      [12084.036736]  ? process_one_work+0x5a0/0x5a0
      [12084.036738]  worker_thread+0x52/0x3b0
      [12084.036743]  ? process_one_work+0x5a0/0x5a0
      [12084.036745]  kthread+0xf2/0x120
      [12084.036747]  ? kthread_complete_and_exit+0x20/0x20
      [12084.036751]  ret_from_fork+0x22/0x30
      [12084.036765]  </TASK>
      [12084.036769] INFO: task kworker/u16:11:153787 blocked for more than 241 seconds.
      [12084.037702]       Not tainted 5.18.0-rc2-btrfs-next-115 #1
      [12084.038540] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [12084.039506] task:kworker/u16:11  state:D stack:    0 pid:153787 ppid:     2 flags:0x00004000
      [12084.039511] Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs]
      [12084.039551] Call Trace:
      [12084.039553]  <TASK>
      [12084.039557]  __schedule+0x3cb/0xed0
      [12084.039566]  schedule+0x4e/0xb0
      [12084.039569]  schedule_timeout+0xed/0x130
      [12084.039573]  ? mark_held_locks+0x50/0x80
      [12084.039578]  ? _raw_spin_unlock_irq+0x24/0x50
      [12084.039580]  ? lockdep_hardirqs_on+0x7d/0x100
      [12084.039585]  __wait_for_common+0xaf/0x1f0
      [12084.039587]  ? usleep_range_state+0xb0/0xb0
      [12084.039596]  btrfs_wait_ordered_extents+0x3d6/0x470 [btrfs]
      [12084.039636]  btrfs_wait_ordered_roots+0x175/0x240 [btrfs]
      [12084.039670]  flush_space+0x25b/0x630 [btrfs]
      [12084.039712]  btrfs_async_reclaim_data_space+0x108/0x1b0 [btrfs]
      [12084.039747]  process_one_work+0x252/0x5a0
      [12084.039756]  ? process_one_work+0x5a0/0x5a0
      [12084.039758]  worker_thread+0x52/0x3b0
      [12084.039762]  ? process_one_work+0x5a0/0x5a0
      [12084.039765]  kthread+0xf2/0x120
      [12084.039766]  ? kthread_complete_and_exit+0x20/0x20
      [12084.039770]  ret_from_fork+0x22/0x30
      [12084.039783]  </TASK>
      [12084.039800] INFO: task kworker/u16:17:217907 blocked for more than 241 seconds.
      [12084.040709]       Not tainted 5.18.0-rc2-btrfs-next-115 #1
      [12084.041398] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [12084.042404] task:kworker/u16:17  state:D stack:    0 pid:217907 ppid:     2 flags:0x00004000
      [12084.042411] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
      [12084.042461] Call Trace:
      [12084.042463]  <TASK>
      [12084.042471]  __schedule+0x3cb/0xed0
      [12084.042485]  schedule+0x4e/0xb0
      [12084.042490]  wait_extent_bit.constprop.0+0x1eb/0x260 [btrfs]
      [12084.042539]  ? prepare_to_wait_exclusive+0xc0/0xc0
      [12084.042551]  lock_extent_bits+0x37/0x90 [btrfs]
      [12084.042601]  btrfs_finish_ordered_io.isra.0+0x3fd/0x960 [btrfs]
      [12084.042656]  ? lock_is_held_type+0xe8/0x140
      [12084.042667]  btrfs_work_helper+0xf8/0x400 [btrfs]
      [12084.042716]  ? lock_is_held_type+0xe8/0x140
      [12084.042727]  process_one_work+0x252/0x5a0
      [12084.042742]  worker_thread+0x52/0x3b0
      [12084.042750]  ? process_one_work+0x5a0/0x5a0
      [12084.042754]  kthread+0xf2/0x120
      [12084.042757]  ? kthread_complete_and_exit+0x20/0x20
      [12084.042763]  ret_from_fork+0x22/0x30
      [12084.042783]  </TASK>
      [12084.042798] INFO: task fio:234517 blocked for more than 241 seconds.
      [12084.043598]       Not tainted 5.18.0-rc2-btrfs-next-115 #1
      [12084.044282] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [12084.045244] task:fio             state:D stack:    0 pid:234517 ppid:234515 flags:0x00004000
      [12084.045248] Call Trace:
      [12084.045250]  <TASK>
      [12084.045254]  __schedule+0x3cb/0xed0
      [12084.045263]  schedule+0x4e/0xb0
      [12084.045266]  wait_extent_bit.constprop.0+0x1eb/0x260 [btrfs]
      [12084.045298]  ? prepare_to_wait_exclusive+0xc0/0xc0
      [12084.045306]  lock_extent_bits+0x37/0x90 [btrfs]
      [12084.045336]  btrfs_dio_iomap_begin+0x336/0xc60 [btrfs]
      [12084.045370]  ? lock_is_held_type+0xe8/0x140
      [12084.045378]  iomap_iter+0x184/0x4c0
      [12084.045383]  __iomap_dio_rw+0x2c6/0x8a0
      [12084.045406]  iomap_dio_rw+0xa/0x30
      [12084.045408]  btrfs_do_write_iter+0x370/0x5e0 [btrfs]
      [12084.045440]  aio_write+0xfa/0x2c0
      [12084.045448]  ? __might_fault+0x2a/0x70
      [12084.045451]  ? kvm_sched_clock_read+0x14/0x40
      [12084.045455]  ? lock_release+0x153/0x4a0
      [12084.045463]  io_submit_one+0x615/0x9f0
      [12084.045467]  ? __might_fault+0x2a/0x70
      [12084.045469]  ? kvm_sched_clock_read+0x14/0x40
      [12084.045478]  __x64_sys_io_submit+0x83/0x160
      [12084.045483]  ? syscall_enter_from_user_mode+0x1d/0x50
      [12084.045489]  do_syscall_64+0x3b/0x90
      [12084.045517]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [12084.045521] RIP: 0033:0x7fa76511af79
      [12084.045525] RSP: 002b:00007ffd6d6b9058 EFLAGS: 00000246 ORIG_RAX: 00000000000000d1
      [12084.045530] RAX: ffffffffffffffda RBX: 00007fa75ba6e760 RCX: 00007fa76511af79
      [12084.045532] RDX: 0000557b304ff3f0 RSI: 0000000000000001 RDI: 00007fa75ba4c000
      [12084.045535] RBP: 00007fa75ba4c000 R08: 00007fa751b76000 R09: 0000000000000330
      [12084.045537] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
      [12084.045540] R13: 0000000000000000 R14: 0000557b304ff3f0 R15: 0000557b30521eb0
      [12084.045561]  </TASK>
      
      Fix this issue by always reserving data space before locking a file range
      at btrfs_dio_iomap_begin(). If we can't reserve the space, then we don't
      error out immediately - instead after locking the file range, check if we
      can do a NOCOW write, and if we can we don't error out since we don't need
      to allocate a data extent, however if we can't NOCOW then error out with
      -ENOSPC. This also implies that we may end up reserving space when it's
      not needed because the write will end up being done in NOCOW mode - in that
      case we just release the space after we noticed we did a NOCOW write - this
      is the same type of logic that is done in the path for buffered IO writes.
      
      Fixes: f0bfa76a ("btrfs: fix ENOSPC failure when attempting direct IO write into NOCOW range")
      CC: stable@vger.kernel.org # 5.17+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f5585f4f
    • Goldwyn Rodrigues's avatar
      btrfs: derive compression type from extent map during reads · 1d8fa2e2
      Goldwyn Rodrigues authored
      Derive the compression type from extent map as opposed to the bio flags
      passed. This makes it more precise and not reliant on function
      parameters.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1d8fa2e2
    • Qu Wenruo's avatar
      btrfs: scrub: move scrub_remap_extent() call into scrub_extent() · a13467ee
      Qu Wenruo authored
      [SUSPICIOUS CODE]
      When refactoring scrub code, I noticed a very strange behavior around
      scrub_remap_extent():
      
      	if (sctx->is_dev_replace)
      		scrub_remap_extent(fs_info, cur_logical, scrub_len,
      				   &cur_physical, &target_dev, &cur_mirror);
      
      As replace target is a 1:1 copy of the source device, thus physical
      offset inside the target should be the same as physical inside source,
      thus this remap call makes no sense to me.
      
      [REAL FUNCTIONALITY]
      After more investigation, the function name scrub_remap_extent()
      doesn't tell anything of the truth, nor does its if () condition.
      
      The real story behind this function is that, for scrub_pages() we never
      expect missing device, even for replacing missing device.
      
      What scrub_remap_extent() is really doing is to find a live mirror, and
      make later scrub_pages() to read data from the good copy, other than
      from the missing device and increase error counters unnecessarily.
      
      [IMPROVEMENT]
      We have no need to bother scrub_remap_extent() in scrub_simple_mirror()
      at all, we only need to call it before we call scrub_pages().
      
      And rename the function to scrub_find_live_copy(), add extra comments on
      them.
      
      By this we can remove one parameter from scrub_extent(), and reduce the
      unnecessary calls to scrub_remap_extent() for regular replace.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a13467ee
    • Qu Wenruo's avatar
      btrfs: scrub: use find_first_extent_item to for extent item search · d483bfd2
      Qu Wenruo authored
      Since we have find_first_extent_item() to iterate the extent items of a
      certain range, there is no need to use the open-coded version.
      
      Replace the final scrub call site with find_first_extent_item().
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d483bfd2
    • Qu Wenruo's avatar
      btrfs: scrub: refactor scrub_raid56_parity() · 9ae53bf9
      Qu Wenruo authored
      Currently scrub_raid56_parity() has a large double loop, handling the
      following things at the same time:
      
      - Iterate each data stripe
      - Iterate each extent item in one data stripe
      
      Refactor this by:
      
      - Introduce a new helper to handle data stripe iteration
        The new helper is scrub_raid56_data_stripe_for_parity(), which
        only has one while() loop handling the extent items inside the
        data stripe.
      
        The code is still mostly the same as the old code.
      
      - Call cond_resched() for each extent
        Previously we only call cond_resched() under a complex if () check.
        I see no special reason to do that, and for other scrub functions,
        like scrub_simple_mirror() we're already doing the same cond_resched()
        after scrubbing one extent.
      
      - Add more comments
      
      Please note that, this patch is only to address the double loop, there
      are incoming patches to do extra cleanup.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9ae53bf9
    • Qu Wenruo's avatar
      btrfs: scrub: use scrub_simple_mirror() to handle RAID56 data stripe scrub · 18d30ab9
      Qu Wenruo authored
      Although RAID56 has complex repair mechanism, which involves reading the
      whole full stripe, but inside one data stripe, it's in fact no different
      than SINGLE/RAID1.
      
      The point here is, for data stripe we just check the csum for each
      extent we hit.  Only for csum mismatch case, our repair paths divide.
      
      So we can still reuse scrub_simple_mirror() for RAID56 data stripes,
      which saves quite some code.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      18d30ab9
    • Qu Wenruo's avatar
      btrfs: scrub: cleanup the non-RAID56 branches in scrub_stripe() · e430c428
      Qu Wenruo authored
      Since we have moved all other profiles handling into their own
      functions, now the main body of scrub_stripe() is just handling RAID56
      profiles.
      
      There is no need to address other profiles in the main loop of
      scrub_stripe(), so we can remove those dead branches.
      
      Since we're here, also slightly change the timing of initialization of
      variables like @offset, @increment and @logical.
      
      Especially for @logical, we don't really need to initialize it for
      btrfs_extent_root()/btrfs_csum_root(), we can use bg->start for that
      purpose.
      
      Now those variables are only initialize for RAID56 branches.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e430c428
    • Qu Wenruo's avatar
      btrfs: scrub: introduce dedicated helper to scrub simple-stripe based range · 8557635e
      Qu Wenruo authored
      The new entrance will iterate through each data stripe which belongs to
      the target device.
      
      And since inside each data stripe, RAID0 is just SINGLE, while RAID10 is
      just RAID1, we can reuse scrub_simple_mirror() to do the scrub properly.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8557635e
    • Qu Wenruo's avatar
      btrfs: scrub: introduce dedicated helper to scrub simple-mirror based range · 09022b14
      Qu Wenruo authored
      The new helper, scrub_simple_mirror(), will scrub all extents inside a
      range which only has simple mirror based duplication.
      
      This covers every range of SINGLE/DUP/RAID1/RAID1C*, and inside each
      data stripe for RAID0/RAID10.
      
      Currently we will use this function to scrub SINGLE/DUP/RAID1/RAID1C*
      profiles.  As one can see, the new entrance for those simple-mirror
      based profiles can be small enough (with comments, just reach 100
      lines).
      
      This function will be the basis for the incoming scrub refactor.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      09022b14
    • Qu Wenruo's avatar
      btrfs: scrub: introduce a helper to locate an extent item · 416bd7e7
      Qu Wenruo authored
      The new helper, find_first_extent_item(), will locate an extent item
      (either EXTENT_ITEM or METADATA_ITEM) which covers any byte of the
      search range.
      
      This helper will later be used to refactor scrub code.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      416bd7e7
    • Qu Wenruo's avatar
      btrfs: calculate physical_end using dev_extent_len directly in scrub_stripe() · 1194a824
      Qu Wenruo authored
      The variable @physical_end is the exclusive stripe end, currently it's
      calculated using @physical + @dev_extent_len / map->stripe_len *
       map->stripe_len.
      
      And since at allocation time we ensured dev_extent_len is stripe_len
      aligned, the result is the same as @physical + @dev_extent_len.
      
      So this patch will just assign @physical and @physical_end early,
      without using @nstripes.
      
      This is especially helpful for any possible out: label user, as now we
      only need to initialize @offset before going to out: label.
      
      Since we're here, also make @physical_end constant.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1194a824
    • Gabriel Niebler's avatar
      btrfs: turn fs_roots_radix in btrfs_fs_info into an XArray · 48b36a60
      Gabriel Niebler authored
      … rename it to simply fs_roots and adjust all usages of this object to use
      the XArray API, because it is notionally easier to use and understand, as
      it provides array semantics, and also takes care of locking for us,
      further simplifying the code.
      
      Also do some refactoring, esp. where the API change requires largely
      rewriting some functions, anyway.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarGabriel Niebler <gniebler@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      48b36a60