1. 21 Jun, 2022 5 commits
    • Naohiro Aota's avatar
      btrfs: zoned: fix critical section of relocation inode writeback · 19ab78ca
      Naohiro Aota authored
      We use btrfs_zoned_data_reloc_{lock,unlock} to allow only one process to
      write out to the relocation inode. That critical section must include all
      the IO submission for the inode. However, flush_write_bio() in
      extent_writepages() is out of the critical section, causing an IO
      submission outside of the lock. This leads to an out of the order IO
      submission and fail the relocation process.
      
      Fix it by extending the critical section.
      
      Fixes: 35156d85 ("btrfs: zoned: only allow one process to add pages to a relocation inode")
      CC: stable@vger.kernel.org # 5.16+
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      19ab78ca
    • Naohiro Aota's avatar
      btrfs: zoned: prevent allocation from previous data relocation BG · 343d8a30
      Naohiro Aota authored
      After commit 5f0addf7 ("btrfs: zoned: use dedicated lock for data
      relocation"), we observe IO errors on e.g, btrfs/232 like below.
      
        [09.0][T4038707] WARNING: CPU: 3 PID: 4038707 at fs/btrfs/extent-tree.c:2381 btrfs_cross_ref_exist+0xfc/0x120 [btrfs]
        <snip>
        [09.9][T4038707] Call Trace:
        [09.5][T4038707]  <TASK>
        [09.3][T4038707]  run_delalloc_nocow+0x7f1/0x11a0 [btrfs]
        [09.6][T4038707]  ? test_range_bit+0x174/0x320 [btrfs]
        [09.2][T4038707]  ? fallback_to_cow+0x980/0x980 [btrfs]
        [09.3][T4038707]  ? find_lock_delalloc_range+0x33e/0x3e0 [btrfs]
        [09.5][T4038707]  btrfs_run_delalloc_range+0x445/0x1320 [btrfs]
        [09.2][T4038707]  ? test_range_bit+0x320/0x320 [btrfs]
        [09.4][T4038707]  ? lock_downgrade+0x6a0/0x6a0
        [09.2][T4038707]  ? orc_find.part.0+0x1ed/0x300
        [09.5][T4038707]  ? __module_address.part.0+0x25/0x300
        [09.0][T4038707]  writepage_delalloc+0x159/0x310 [btrfs]
        <snip>
        [09.4][    C3] sd 10:0:1:0: [sde] tag#2620 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
        [09.5][    C3] sd 10:0:1:0: [sde] tag#2620 Sense Key : Illegal Request [current]
        [09.9][    C3] sd 10:0:1:0: [sde] tag#2620 Add. Sense: Unaligned write command
        [09.5][    C3] sd 10:0:1:0: [sde] tag#2620 CDB: Write(16) 8a 00 00 00 00 00 02 f3 63 87 00 00 00 2c 00 00
        [09.4][    C3] critical target error, dev sde, sector 396041272 op 0x1:(WRITE) flags 0x800 phys_seg 3 prio class 0
        [09.9][    C3] BTRFS error (device dm-1): bdev /dev/mapper/dml_102_2 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
      
      The IO errors occur when we allocate a regular extent in previous data
      relocation block group.
      
      On zoned btrfs, we use a dedicated block group to relocate a data
      extent. Thus, we allocate relocating data extents (pre-alloc) only from
      the dedicated block group and vice versa. Once the free space in the
      dedicated block group gets tight, a relocating extent may not fit into
      the block group. In that case, we need to switch the dedicated block
      group to the next one. Then, the previous one is now freed up for
      allocating a regular extent. The BG is already not enough to allocate
      the relocating extent, but there is still room to allocate a smaller
      extent. Now the problem happens. By allocating a regular extent while
      nocow IOs for the relocation is still on-going, we will issue WRITE IOs
      (for relocation) and ZONE APPEND IOs (for the regular writes) at the
      same time. That mixed IOs confuses the write pointer and arises the
      unaligned write errors.
      
      This commit introduces a new bit 'zoned_data_reloc_ongoing' to the
      btrfs_block_group. We set this bit before releasing the dedicated block
      group, and no extent are allocated from a block group having this bit
      set. This bit is similar to setting block_group->ro, but is different from
      it by allowing nocow writes to start.
      
      Once all the nocow IO for relocation is done (hooked from
      btrfs_finish_ordered_io), we reset the bit to release the block group for
      further allocation.
      
      Fixes: c2707a25 ("btrfs: zoned: add a dedicated data relocation block group")
      CC: stable@vger.kernel.org # 5.16+
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      343d8a30
    • Filipe Manana's avatar
      btrfs: do not BUG_ON() on failure to migrate space when replacing extents · 650c9cab
      Filipe Manana authored
      At btrfs_replace_file_extents(), if we fail to migrate reserved metadata
      space from the transaction block reserve into the local block reserve,
      we trigger a BUG_ON(). This is because it should not be possible to have
      a failure here, as we reserved more space when we started the transaction
      than the space we want to migrate. However having a BUG_ON() is way too
      drastic, we can perfectly handle the failure and return the error to the
      caller. So just do that instead, and add a WARN_ON() to make it easier
      to notice the failure if it ever happens (which is particularly useful
      for fstests, and the warning will trigger a failure of a test case).
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      650c9cab
    • Filipe Manana's avatar
      btrfs: add missing inode updates on each iteration when replacing extents · 983d8209
      Filipe Manana authored
      When replacing file extents, called during fallocate, hole punching,
      clone and deduplication, we may not be able to replace/drop all the
      target file extent items with a single transaction handle. We may get
      -ENOSPC while doing it, in which case we release the transaction handle,
      balance the dirty pages of the btree inode, flush delayed items and get
      a new transaction handle to operate on what's left of the target range.
      
      By dropping and replacing file extent items we have effectively modified
      the inode, so we should bump its iversion and update its mtime/ctime
      before we update the inode item. This is because if the transaction
      we used for partially modifying the inode gets committed by someone after
      we release it and before we finish the rest of the range, a power failure
      happens, then after mounting the filesystem our inode has an outdated
      iversion and mtime/ctime, corresponding to the values it had before we
      changed it.
      
      So add the missing iversion and mtime/ctime updates.
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      983d8209
    • Filipe Manana's avatar
      btrfs: fix race between reflinking and ordered extent completion · d4597898
      Filipe Manana authored
      While doing a reflink operation, if an ordered extent for a file range
      that does not overlap with the source and destination ranges of the
      reflink operation happens, we can end up having a failure in the reflink
      operation and return -EINVAL to user space.
      
      The following sequence of steps explains how this can happen:
      
      1) We have the page at file offset 315392 dirty (under delalloc);
      
      2) A reflink operation for this file starts, using the same file as both
         source and destination, the source range is [372736, 409600) (length of
         36864 bytes) and the destination range is [208896, 245760);
      
      3) At btrfs_remap_file_range_prep(), we flush all delalloc in the source
         and destination ranges, and wait for any ordered extents in those range
         to complete;
      
      4) Still at btrfs_remap_file_range_prep(), we then flush all delalloc in
         the inode, but we neither wait for it to complete nor any ordered
         extents to complete. This results in starting delalloc for the page at
         file offset 315392 and creating an ordered extent for that single page
         range;
      
      5) We then move to btrfs_clone() and enter the loop to find file extent
         items to copy from the source range to destination range;
      
      6) In the first iteration we end up at last file extent item stored in
         leaf A:
      
         (...)
         item 131 key (143616 108 315392) itemoff 5101 itemsize 53
                  extent data disk bytenr 1903988736 nr 73728
                  extent data offset 12288 nr 61440 ram 73728
      
         This represents the file range [315392, 376832), which overlaps with
         the source range to clone.
      
         @datal is set to 61440, key.offset is 315392 and @next_key_min_offset
         is therefore set to 376832 (315392 + 61440).
      
         @off (372736) is > key.offset (315392), so @new_key.offset is set to
         the value of @destoff (208896).
      
         @new_key.offset == @last_dest_end (208896) so @drop_start is set to
         208896 (@new_key.offset).
      
         @datal is adjusted to 4096, as @off is > @key.offset.
      
         So in this iteration we call btrfs_replace_file_extents() for the range
         [208896, 212991] (a single page, which is
         [@drop_start, @new_key.offset + @datal - 1]).
      
         @last_dest_end is set to 212992 (@new_key.offset + @datal =
         208896 + 4096 = 212992).
      
         Before the next iteration of the loop, @key.offset is set to the value
         376832, which is @next_key_min_offset;
      
      7) On the second iteration btrfs_search_slot() leaves us again at leaf A,
         but this time pointing beyond the last slot of leaf A, as that's where
         a key with offset 376832 should be at if it existed. So end up calling
         btrfs_next_leaf();
      
      8) btrfs_next_leaf() releases the path, but before it searches again the
         tree for the next key/leaf, the ordered extent for the single page
         range at file offset 315392 completes. That results in trimming the
         file extent item we processed before, adjusting its key offset from
         315392 to 319488, reducing its length from 61440 to 57344 and inserting
         a new file extent item for that single page range, with a key offset of
         315392 and a length of 4096.
      
         Leaf A now looks like:
      
           (...)
           item 132 key (143616 108 315392) itemoff 4995 itemsize 53
                    extent data disk bytenr 1801666560 nr 4096
                    extent data offset 0 nr 4096 ram 4096
           item 133 key (143616 108 319488) itemoff 4942 itemsize 53
                    extent data disk bytenr 1903988736 nr 73728
                    extent data offset 16384 nr 57344 ram 73728
      
      9) When btrfs_next_leaf() returns, it gives us a path pointing to leaf A
         at slot 133, since it's the first key that follows what was the last
         key we saw (143616 108 315392). In fact it's the same item we processed
         before, but its key offset was changed, so it counts as a new key;
      
      10) So now we have:
      
          @key.offset == 319488
          @datal == 57344
      
          @off (372736) is > key.offset (319488), so @new_key.offset is set to
          208896 (@destoff value).
      
          @new_key.offset (208896) != @last_dest_end (212992), so @drop_start
          is set to 212992 (@last_dest_end value).
      
          @datal is adjusted to 4096 because @off > @key.offset.
      
          So in this iteration we call btrfs_replace_file_extents() for the
          invalid range of [212992, 212991] (which is
          [@drop_start, @new_key.offset + @datal - 1]).
      
          This range is empty, the end offset is smaller than the start offset
          so btrfs_replace_file_extents() returns -EINVAL, which we end up
          returning to user space and fail the reflink operation.
      
          This all happens because the range of this file extent item was
          already processed in the previous iteration.
      
      This scenario can be triggered very sporadically by fsx from fstests, for
      example with test case generic/522.
      
      So fix this by having btrfs_clone() skip file extent items that cover a
      file range that we have already processed.
      
      CC: stable@vger.kernel.org # 5.10+
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d4597898
  2. 07 Jun, 2022 1 commit
  3. 06 Jun, 2022 2 commits
    • Qu Wenruo's avatar
      btrfs: prevent remounting to v1 space cache for subpage mount · 0591f040
      Qu Wenruo authored
      Upstream commit 9f73f1ae ("btrfs: force v2 space cache usage for
      subpage mount") forces subpage mount to use v2 cache, to avoid
      deprecated v1 cache which doesn't support subpage properly.
      
      But there is a loophole that user can still remount to v1 cache.
      
      The existing check will only give users a warning, but does not really
      prevent to do the remount.
      
      Although remounting to v1 will not cause any problems since the v1 cache
      will always be marked invalid when mounted with a different page size,
      it's still better to prevent v1 cache at all for subpage mounts.
      
      Fixes: 9f73f1ae ("btrfs: force v2 space cache usage for subpage mount")
      CC: stable@vger.kernel.org # 5.15+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0591f040
    • Filipe Manana's avatar
      btrfs: fix hang during unmount when block group reclaim task is running · 31e70e52
      Filipe Manana authored
      When we start an unmount, at close_ctree(), if we have the reclaim task
      running and in the middle of a data block group relocation, we can trigger
      a deadlock when stopping an async reclaim task, producing a trace like the
      following:
      
      [629724.498185] task:kworker/u16:7   state:D stack:    0 pid:681170 ppid:     2 flags:0x00004000
      [629724.499760] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
      [629724.501267] Call Trace:
      [629724.501759]  <TASK>
      [629724.502174]  __schedule+0x3cb/0xed0
      [629724.502842]  schedule+0x4e/0xb0
      [629724.503447]  btrfs_wait_on_delayed_iputs+0x7c/0xc0 [btrfs]
      [629724.504534]  ? prepare_to_wait_exclusive+0xc0/0xc0
      [629724.505442]  flush_space+0x423/0x630 [btrfs]
      [629724.506296]  ? rcu_read_unlock_trace_special+0x20/0x50
      [629724.507259]  ? lock_release+0x220/0x4a0
      [629724.507932]  ? btrfs_get_alloc_profile+0xb3/0x290 [btrfs]
      [629724.508940]  ? do_raw_spin_unlock+0x4b/0xa0
      [629724.509688]  btrfs_async_reclaim_metadata_space+0x139/0x320 [btrfs]
      [629724.510922]  process_one_work+0x252/0x5a0
      [629724.511694]  ? process_one_work+0x5a0/0x5a0
      [629724.512508]  worker_thread+0x52/0x3b0
      [629724.513220]  ? process_one_work+0x5a0/0x5a0
      [629724.514021]  kthread+0xf2/0x120
      [629724.514627]  ? kthread_complete_and_exit+0x20/0x20
      [629724.515526]  ret_from_fork+0x22/0x30
      [629724.516236]  </TASK>
      [629724.516694] task:umount          state:D stack:    0 pid:719055 ppid:695412 flags:0x00004000
      [629724.518269] Call Trace:
      [629724.518746]  <TASK>
      [629724.519160]  __schedule+0x3cb/0xed0
      [629724.519835]  schedule+0x4e/0xb0
      [629724.520467]  schedule_timeout+0xed/0x130
      [629724.521221]  ? lock_release+0x220/0x4a0
      [629724.521946]  ? lock_acquired+0x19c/0x420
      [629724.522662]  ? trace_hardirqs_on+0x1b/0xe0
      [629724.523411]  __wait_for_common+0xaf/0x1f0
      [629724.524189]  ? usleep_range_state+0xb0/0xb0
      [629724.524997]  __flush_work+0x26d/0x530
      [629724.525698]  ? flush_workqueue_prep_pwqs+0x140/0x140
      [629724.526580]  ? lock_acquire+0x1a0/0x310
      [629724.527324]  __cancel_work_timer+0x137/0x1c0
      [629724.528190]  close_ctree+0xfd/0x531 [btrfs]
      [629724.529000]  ? evict_inodes+0x166/0x1c0
      [629724.529510]  generic_shutdown_super+0x74/0x120
      [629724.530103]  kill_anon_super+0x14/0x30
      [629724.530611]  btrfs_kill_super+0x12/0x20 [btrfs]
      [629724.531246]  deactivate_locked_super+0x31/0xa0
      [629724.531817]  cleanup_mnt+0x147/0x1c0
      [629724.532319]  task_work_run+0x5c/0xa0
      [629724.532984]  exit_to_user_mode_prepare+0x1a6/0x1b0
      [629724.533598]  syscall_exit_to_user_mode+0x16/0x40
      [629724.534200]  do_syscall_64+0x48/0x90
      [629724.534667]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [629724.535318] RIP: 0033:0x7fa2b90437a7
      [629724.535804] RSP: 002b:00007ffe0b7e4458 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
      [629724.536912] RAX: 0000000000000000 RBX: 00007fa2b9182264 RCX: 00007fa2b90437a7
      [629724.538156] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000555d6cf20dd0
      [629724.539053] RBP: 0000555d6cf20ba0 R08: 0000000000000000 R09: 00007ffe0b7e3200
      [629724.539956] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
      [629724.540883] R13: 0000555d6cf20dd0 R14: 0000555d6cf20cb0 R15: 0000000000000000
      [629724.541796]  </TASK>
      
      This happens because:
      
      1) Before entering close_ctree() we have the async block group reclaim
         task running and relocating a data block group;
      
      2) There's an async metadata (or data) space reclaim task running;
      
      3) We enter close_ctree() and park the cleaner kthread;
      
      4) The async space reclaim task is at flush_space() and runs all the
         existing delayed iputs;
      
      5) Before the async space reclaim task calls
         btrfs_wait_on_delayed_iputs(), the block group reclaim task which is
         doing the data block group relocation, creates a delayed iput at
         replace_file_extents() (called when COWing leaves that have file extent
         items pointing to relocated data extents, during the merging phase
         of relocation roots);
      
      6) The async reclaim space reclaim task blocks at
         btrfs_wait_on_delayed_iputs(), since we have a new delayed iput;
      
      7) The task at close_ctree() then calls cancel_work_sync() to stop the
         async space reclaim task, but it blocks since that task is waiting for
         the delayed iput to be run;
      
      8) The delayed iput is never run because the cleaner kthread is parked,
         and no one else runs delayed iputs, resulting in a hang.
      
      So fix this by stopping the async block group reclaim task before we
      park the cleaner kthread.
      
      Fixes: 18bb8bbf ("btrfs: zoned: automatically reclaim zones")
      CC: stable@vger.kernel.org # 5.15+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      31e70e52
  4. 17 May, 2022 6 commits
    • Johannes Thumshirn's avatar
      btrfs: zoned: introduce a minimal zone size 4M and reject mount · 0a05fafe
      Johannes Thumshirn authored
      Zoned devices are expected to have zone sizes in the range of 1-2GB for
      ZNS SSDs and SMR HDDs have zone sizes of 256MB, so there is no need to
      allow arbitrarily small zone sizes on btrfs.
      
      But for testing purposes with emulated devices it is sometimes desirable
      to create devices with as small as 4MB zone size to uncover errors.
      
      So use 4MB as the smallest possible zone size and reject mounts of devices
      with a smaller zone size.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0a05fafe
    • Qu Wenruo's avatar
      btrfs: allow defrag to convert inline extents to regular extents · d8101a0c
      Qu Wenruo authored
      Btrfs defaults to max_inline=2K to make small writes inlined into
      metadata.
      
      The default value is always a win, as even DUP/RAID1/RAID10 doubles the
      metadata usage, it should still cause less physical space used compared
      to a 4K regular extents.
      
      But since the introduction of RAID1C3 and RAID1C4 it's no longer the case,
      users may find inlined extents causing too much space wasted, and want
      to convert those inlined extents back to regular extents.
      
      Unfortunately defrag will unconditionally skip all inline extents, no
      matter if the user is trying to converting them back to regular extents.
      
      So this patch will add a small exception for defrag_collect_targets() to
      allow defragging inline extents, if and only if the inlined extents are
      larger than max_inline, allowing users to convert them to regular ones.
      
      This also allows us to defrag extents like the following:
      
      	item 6 key (257 EXTENT_DATA 0) itemoff 15794 itemsize 69
      		generation 7 type 0 (inline)
      		inline extent data size 48 ram_bytes 4096 compression 1 (zlib)
      	item 7 key (257 EXTENT_DATA 4096) itemoff 15741 itemsize 53
      		generation 7 type 1 (regular)
      		extent data disk byte 13631488 nr 4096
      		extent data offset 0 nr 16384 ram 16384
      		extent compression 1 (zlib)
      
      Previously we're unable to do any defrag, since the first extent is
      inlined, and the second one has no extent to merge.
      
      Now we can defrag it to just one single extent, saving 48 bytes metadata
      space.
      
      	item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53
      		generation 8 type 1 (regular)
      		extent data disk byte 13635584 nr 4096
      		extent data offset 0 nr 20480 ram 20480
      		extent compression 1 (zlib)
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d8101a0c
    • Qu Wenruo's avatar
      btrfs: add "0x" prefix for unsupported optional features · d5321a0f
      Qu Wenruo authored
      The following error message lack the "0x" obviously:
      
        cannot mount because of unsupported optional features (4000)
      
      Add the prefix to make it less confusing. This can happen on older
      kernels that try to mount a filesystem with newer features so it makes
      sense to backport to older trees.
      
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d5321a0f
    • Filipe Manana's avatar
      btrfs: do not account twice for inode ref when reserving metadata units · 97bdf1a9
      Filipe Manana authored
      When reserving metadata units for creating an inode, we don't need to
      reserve one extra unit for the inode ref item because when creating the
      inode, at btrfs_create_new_inode(), we always insert the inode item and
      the inode ref item in a single batch (a single btree insert operation,
      and both ending up in the same leaf).
      
      As we have accounted already one unit for the inode item, the extra unit
      for the inode ref item is superfluous, it only makes us reserve more
      metadata than necessary and often adding more reclaim pressure if we are
      low on available metadata space.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      97bdf1a9
    • Naohiro Aota's avatar
      btrfs: zoned: fix comparison of alloc_offset vs meta_write_pointer · aa9ffadf
      Naohiro Aota authored
      The block_group->alloc_offset is an offset from the start of the block
      group. OTOH, the ->meta_write_pointer is an address in the logical
      space. So, we should compare the alloc_offset shifted with the
      block_group->start.
      
      Fixes: afba2bc0 ("btrfs: zoned: implement active zone tracking")
      CC: stable@vger.kernel.org # 5.16+
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      aa9ffadf
    • Filipe Manana's avatar
      btrfs: send: avoid trashing the page cache · 152555b3
      Filipe Manana authored
      A send operation reads extent data using the buffered IO path for getting
      extent data to send in write commands and this is both because it's simple
      and to make use of the generic readahead infrastructure, which results in
      a massive speedup.
      
      However this fills the page cache with data that, most of the time, is
      really only used by the send operation - once the write commands are sent,
      it's not useful to have the data in the page cache anymore. For large
      snapshots, bringing all data into the page cache eventually leads to the
      need to evict other data from the page cache that may be more useful for
      applications (and kernel subsystems).
      
      Even if extents are shared with the subvolume on which a snapshot is based
      on and the data is currently on the page cache due to being read through
      the subvolume, attempting to read the data through the snapshot will
      always result in bringing a new copy of the data into another location in
      the page cache (there's currently no shared memory for shared extents).
      
      So make send evict the data it has read before if when it first opened
      the inode, its mapping had no pages currently loaded: when
      inode->i_mapping->nr_pages has a value of 0. Do this instead of deciding
      based on the return value of filemap_range_has_page() before reading an
      extent because the generic readahead mechanism may read pages beyond the
      range we request (and it very often does it), which means a call to
      filemap_range_has_page() will return true due to the readahead that was
      triggered when processing a previous extent - we don't have a simple way
      to distinguish this case from the case where the data was brought into
      the page cache through someone else. So checking for the mapping number
      of pages being 0 when we first open the inode is simple, cheap and it
      generally accomplishes the goal of not trashing the page cache - the
      only exception is if part of data was previously loaded into the page
      cache through the snapshot by some other process, in that case we end
      up not evicting any data send brings into the page cache, just like
      before this change - but that however is not the common case.
      
      Example scenario, on a box with 32G of RAM:
      
        $ btrfs subvolume create /mnt/sv1
        $ xfs_io -f -c "pwrite 0 4G" /mnt/sv1/file1
      
        $ btrfs subvolume snapshot -r /mnt/sv1 /mnt/snap1
      
        $ free -m
                       total        used        free      shared  buff/cache   available
        Mem:           31937         186       26866           0        4883       31297
        Swap:           8188           0        8188
      
        # After this we get less 4G of free memory.
        $ btrfs send /mnt/snap1 >/dev/null
      
        $ free -m
                       total        used        free      shared  buff/cache   available
        Mem:           31937         186       22814           0        8935       31297
        Swap:           8188           0        8188
      
      The same, obviously, applies to an incremental send.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      152555b3
  5. 16 May, 2022 26 commits
    • Filipe Manana's avatar
      btrfs: send: keep the current inode open while processing it · 521b6803
      Filipe Manana authored
      Every time we send a write command, we open the inode, read some data to
      a buffer and then close the inode. The amount of data we read for each
      write command is at most 48K, returned by max_send_read_size(), and that
      corresponds to: BTRFS_SEND_BUF_SIZE - 16K = 48K. In practice this does
      not add any significant overhead, because the time elapsed between every
      close (iput()) and open (btrfs_iget()) is very short, so the inode is kept
      in the VFS's cache after the iput() and it's still there by the time we
      do the next btrfs_iget().
      
      As between processing extents of the current inode we don't do anything
      else, it makes sense to keep the inode open after we process its first
      extent that needs to be sent and keep it open until we start processing
      the next inode. This serves to facilitate the next change, which aims
      to avoid having send operations trash the page cache with data extents.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      521b6803
    • Christoph Hellwig's avatar
      btrfs: allocate the btrfs_dio_private as part of the iomap dio bio · 642c5d34
      Christoph Hellwig authored
      Create a new bio_set that contains all the per-bio private data needed
      by btrfs for direct I/O and tell the iomap code to use that instead
      of separately allocation the btrfs_dio_private structure.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      642c5d34
    • Christoph Hellwig's avatar
      btrfs: move struct btrfs_dio_private to inode.c · a3e171a0
      Christoph Hellwig authored
      The btrfs_dio_private structure is only used in inode.c, so move the
      definition there.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a3e171a0
    • Christoph Hellwig's avatar
      btrfs: remove the disk_bytenr in struct btrfs_dio_private · acb8b52a
      Christoph Hellwig authored
      This field is never used, so remove it. Last use was probably in
      23ea8e5a ("Btrfs: load checksum data once when submitting a direct
      read io").
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      acb8b52a
    • Christoph Hellwig's avatar
      btrfs: allocate dio_data on stack · 491a6d01
      Christoph Hellwig authored
      Make use of the new iomap_iter->private field to avoid a memory
      allocation per iomap range.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      491a6d01
    • Christoph Hellwig's avatar
      iomap: add per-iomap_iter private data · 786f847f
      Christoph Hellwig authored
      Allow the file system to keep state for all iterations.  For now only
      wire it up for direct I/O as there is an immediate need for it there.
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      786f847f
    • Christoph Hellwig's avatar
      iomap: allow the file system to provide a bio_set for direct I/O · 908c5490
      Christoph Hellwig authored
      Allow the file system to provide a specific bio_set for allocating
      direct I/O bios.  This will allow file systems that use the
      ->submit_io hook to stash away additional information for file system
      use.
      
      To make use of this additional space for information in the completion
      path, the file system needs to override the ->bi_end_io callback and
      then call back into iomap, so export iomap_dio_bio_end_io for that.
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      908c5490
    • Christoph Hellwig's avatar
      btrfs: add a btrfs_dio_rw wrapper · 36e8c622
      Christoph Hellwig authored
      Add a wrapper around iomap_dio_rw that keeps the direct I/O internals
      isolated in inode.c.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      36e8c622
    • Naohiro Aota's avatar
      btrfs: zoned: zone finish unused block group · 74e91b12
      Naohiro Aota authored
      While the active zones within an active block group are reset, and their
      active resource is released, the block group itself is kept in the active
      block group list and marked as active. As a result, the list will contain
      more than max_active_zones block groups. That itself is not fatal for the
      device as the zones are properly reset.
      
      However, that inflated list is, of course, strange. Also, a to-appear
      patch series, which deactivates an active block group on demand, gets
      confused with the wrong list.
      
      So, fix the issue by finishing the unused block group once it gets
      read-only, so that we can release the active resource in an early stage.
      
      Fixes: be1a1d7a ("btrfs: zoned: finish fully written block group")
      CC: stable@vger.kernel.org # 5.16+
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      74e91b12
    • Naohiro Aota's avatar
      btrfs: zoned: properly finish block group on metadata write · 56fbb0a4
      Naohiro Aota authored
      Commit be1a1d7a ("btrfs: zoned: finish fully written block group")
      introduced zone finishing code both for data and metadata end_io path.
      However, the metadata side is not working as it should. First, it
      compares logical address (eb->start + eb->len) with offset within a
      block group (cache->zone_capacity) in submit_eb_page(). That essentially
      disabled zone finishing on metadata end_io path.
      
      Furthermore, fixing the issue above revealed we cannot call
      btrfs_zone_finish_endio() in end_extent_buffer_writeback(). We cannot
      call btrfs_lookup_block_group() which require spin lock inside end_io
      context.
      
      Introduce btrfs_schedule_zone_finish_bg() to wait for the extent buffer
      writeback and do the zone finish IO in a workqueue.
      
      Also, drop EXTENT_BUFFER_ZONE_FINISH as it is no longer used.
      
      Fixes: be1a1d7a ("btrfs: zoned: finish fully written block group")
      CC: stable@vger.kernel.org # 5.16+
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      56fbb0a4
    • Naohiro Aota's avatar
      btrfs: zoned: finish block group when there are no more allocatable bytes left · 8b8a5399
      Naohiro Aota authored
      Currently, btrfs_zone_finish_endio() finishes a block group only when the
      written region reaches the end of the block group. We can also finish the
      block group when no more allocation is possible.
      
      Fixes: be1a1d7a ("btrfs: zoned: finish fully written block group")
      CC: stable@vger.kernel.org # 5.16+
      Reviewed-by: default avatarPankaj Raghav <p.raghav@samsung.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8b8a5399
    • Naohiro Aota's avatar
      btrfs: zoned: consolidate zone finish functions · d70cbdda
      Naohiro Aota authored
      btrfs_zone_finish() and btrfs_zone_finish_endio() have similar code.
      Introduce do_zone_finish() to factor out the common code.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d70cbdda
    • Naohiro Aota's avatar
      btrfs: zoned: introduce btrfs_zoned_bg_is_full · 1bfd4767
      Naohiro Aota authored
      Introduce a wrapper to check if all the space in a block group is
      allocated or not.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1bfd4767
    • Nikolay Borisov's avatar
      btrfs: improve error reporting in lookup_inline_extent_backref · cf4f03c3
      Nikolay Borisov authored
      When iterating the backrefs in an extent item if the ptr to the
      'current' backref record goes beyond the extent item a warning is
      generated and -ENOENT is returned. However what's more appropriate to
      debug such cases would be to return EUCLEAN and also print identifying
      information about the performed search as well as the current content of
      the leaf containing the possibly corrupted extent item.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cf4f03c3
    • David Sterba's avatar
      btrfs: rename bio_ctrl::bio_flags to compress_type · 0f07003b
      David Sterba authored
      The bio_ctrl is the last use of bio_flags that has been converted to
      compress type everywhere else.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0f07003b
    • David Sterba's avatar
      btrfs: rename bio_flags in parameters and switch type · cb3a12d9
      David Sterba authored
      Several functions take parameter bio_flags that was simplified to just
      compress type, unify it and change the type accordingly.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cb3a12d9
    • David Sterba's avatar
      btrfs: rename io_failure_record::bio_flags to compress_type · 0ff40013
      David Sterba authored
      The bio_flags is now used to store unchanged compress type, so unify
      that.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0ff40013
    • David Sterba's avatar
      btrfs: open code extent_set_compress_type helpers · 7f6ca7f2
      David Sterba authored
      The helpers extent_set_compress_type and extent_compress_type have
      become trivial after previous cleanups and can be removed.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7f6ca7f2
    • David Sterba's avatar
      btrfs: simplify handling of bio_ctrl::bio_flags · 2a5232a8
      David Sterba authored
      The bio_flags are used only to encode the compression and there are no
      other EXTENT_BIO_* flags, so the compress type can be stored directly.
      The struct member name is left unchanged and will be cleaned in later
      patches.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2a5232a8
    • David Sterba's avatar
      btrfs: remove trivial helper update_nr_written · 572f3dad
      David Sterba authored
      The helper used to do more with the wbc state but now it's just one
      subtraction, no need to have a special helper.
      
      It became trivial in a9132667 ("Btrfs: make mapping->writeback_index
      point to the last written page").
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      572f3dad
    • David Sterba's avatar
    • David Sterba's avatar
      btrfs: remove btrfs_delayed_extent_op::is_data · 0e3696f8
      David Sterba authored
      The value of btrfs_delayed_extent_op::is_data is always false, we can
      cascade the change and simplify code that depends on it, removing the
      structure member eventually.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0e3696f8
    • David Sterba's avatar
      btrfs: sink parameter is_data to btrfs_set_disk_extent_flags · 2fe6a5a1
      David Sterba authored
      The parameter has been added in 2009 in the infamous monster commit
      5d4f98a2 ("Btrfs: Mixed back reference  (FORWARD ROLLING FORMAT
      CHANGE)") but not used ever since. We can sink it and allow further
      simplifications.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2fe6a5a1
    • Filipe Manana's avatar
      btrfs: fix deadlock between concurrent dio writes when low on free data space · f5585f4f
      Filipe Manana authored
      When reserving data space for a direct IO write we can end up deadlocking
      if we have multiple tasks attempting a write to the same file range, there
      are multiple extents covered by that file range, we are low on available
      space for data and the writes don't expand the inode's i_size.
      
      The deadlock can happen like this:
      
      1) We have a file with an i_size of 1M, at offset 0 it has an extent with
         a size of 128K and at offset 128K it has another extent also with a
         size of 128K;
      
      2) Task A does a direct IO write against file range [0, 256K), and because
         the write is within the i_size boundary, it takes the inode's lock (VFS
         level) in shared mode;
      
      3) Task A locks the file range [0, 256K) at btrfs_dio_iomap_begin(), and
         then gets the extent map for the extent covering the range [0, 128K).
         At btrfs_get_blocks_direct_write(), it creates an ordered extent for
         that file range ([0, 128K));
      
      4) Before returning from btrfs_dio_iomap_begin(), it unlocks the file
         range [0, 256K);
      
      5) Task A executes btrfs_dio_iomap_begin() again, this time for the file
         range [128K, 256K), and locks the file range [128K, 256K);
      
      6) Task B starts a direct IO write against file range [0, 256K) as well.
         It also locks the inode in shared mode, as it's within the i_size limit,
         and then tries to lock file range [0, 256K). It is able to lock the
         subrange [0, 128K) but then blocks waiting for the range [128K, 256K),
         as it is currently locked by task A;
      
      7) Task A enters btrfs_get_blocks_direct_write() and tries to reserve data
         space. Because we are low on available free space, it triggers the
         async data reclaim task, and waits for it to reserve data space;
      
      8) The async reclaim task decides to wait for all existing ordered extents
         to complete (through btrfs_wait_ordered_roots()).
         It finds the ordered extent previously created by task A for the file
         range [0, 128K) and waits for it to complete;
      
      9) The ordered extent for the file range [0, 128K) can not complete
         because it blocks at btrfs_finish_ordered_io() when trying to lock the
         file range [0, 128K).
      
         This results in a deadlock, because:
      
         - task B is holding the file range [0, 128K) locked, waiting for the
           range [128K, 256K) to be unlocked by task A;
      
         - task A is holding the file range [128K, 256K) locked and it's waiting
           for the async data reclaim task to satisfy its space reservation
           request;
      
         - the async data reclaim task is waiting for ordered extent [0, 128K)
           to complete, but the ordered extent can not complete because the
           file range [0, 128K) is currently locked by task B, which is waiting
           on task A to unlock file range [128K, 256K) and task A waiting
           on the async data reclaim task.
      
         This results in a deadlock between 4 task: task A, task B, the async
         data reclaim task and the task doing ordered extent completion (a work
         queue task).
      
      This type of deadlock can sporadically be triggered by the test case
      generic/300 from fstests, and results in a stack trace like the following:
      
      [12084.033689] INFO: task kworker/u16:7:123749 blocked for more than 241 seconds.
      [12084.034877]       Not tainted 5.18.0-rc2-btrfs-next-115 #1
      [12084.035562] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [12084.036548] task:kworker/u16:7   state:D stack:    0 pid:123749 ppid:     2 flags:0x00004000
      [12084.036554] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
      [12084.036599] Call Trace:
      [12084.036601]  <TASK>
      [12084.036606]  __schedule+0x3cb/0xed0
      [12084.036616]  schedule+0x4e/0xb0
      [12084.036620]  btrfs_start_ordered_extent+0x109/0x1c0 [btrfs]
      [12084.036651]  ? prepare_to_wait_exclusive+0xc0/0xc0
      [12084.036659]  btrfs_run_ordered_extent_work+0x1a/0x30 [btrfs]
      [12084.036688]  btrfs_work_helper+0xf8/0x400 [btrfs]
      [12084.036719]  ? lock_is_held_type+0xe8/0x140
      [12084.036727]  process_one_work+0x252/0x5a0
      [12084.036736]  ? process_one_work+0x5a0/0x5a0
      [12084.036738]  worker_thread+0x52/0x3b0
      [12084.036743]  ? process_one_work+0x5a0/0x5a0
      [12084.036745]  kthread+0xf2/0x120
      [12084.036747]  ? kthread_complete_and_exit+0x20/0x20
      [12084.036751]  ret_from_fork+0x22/0x30
      [12084.036765]  </TASK>
      [12084.036769] INFO: task kworker/u16:11:153787 blocked for more than 241 seconds.
      [12084.037702]       Not tainted 5.18.0-rc2-btrfs-next-115 #1
      [12084.038540] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [12084.039506] task:kworker/u16:11  state:D stack:    0 pid:153787 ppid:     2 flags:0x00004000
      [12084.039511] Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs]
      [12084.039551] Call Trace:
      [12084.039553]  <TASK>
      [12084.039557]  __schedule+0x3cb/0xed0
      [12084.039566]  schedule+0x4e/0xb0
      [12084.039569]  schedule_timeout+0xed/0x130
      [12084.039573]  ? mark_held_locks+0x50/0x80
      [12084.039578]  ? _raw_spin_unlock_irq+0x24/0x50
      [12084.039580]  ? lockdep_hardirqs_on+0x7d/0x100
      [12084.039585]  __wait_for_common+0xaf/0x1f0
      [12084.039587]  ? usleep_range_state+0xb0/0xb0
      [12084.039596]  btrfs_wait_ordered_extents+0x3d6/0x470 [btrfs]
      [12084.039636]  btrfs_wait_ordered_roots+0x175/0x240 [btrfs]
      [12084.039670]  flush_space+0x25b/0x630 [btrfs]
      [12084.039712]  btrfs_async_reclaim_data_space+0x108/0x1b0 [btrfs]
      [12084.039747]  process_one_work+0x252/0x5a0
      [12084.039756]  ? process_one_work+0x5a0/0x5a0
      [12084.039758]  worker_thread+0x52/0x3b0
      [12084.039762]  ? process_one_work+0x5a0/0x5a0
      [12084.039765]  kthread+0xf2/0x120
      [12084.039766]  ? kthread_complete_and_exit+0x20/0x20
      [12084.039770]  ret_from_fork+0x22/0x30
      [12084.039783]  </TASK>
      [12084.039800] INFO: task kworker/u16:17:217907 blocked for more than 241 seconds.
      [12084.040709]       Not tainted 5.18.0-rc2-btrfs-next-115 #1
      [12084.041398] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [12084.042404] task:kworker/u16:17  state:D stack:    0 pid:217907 ppid:     2 flags:0x00004000
      [12084.042411] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
      [12084.042461] Call Trace:
      [12084.042463]  <TASK>
      [12084.042471]  __schedule+0x3cb/0xed0
      [12084.042485]  schedule+0x4e/0xb0
      [12084.042490]  wait_extent_bit.constprop.0+0x1eb/0x260 [btrfs]
      [12084.042539]  ? prepare_to_wait_exclusive+0xc0/0xc0
      [12084.042551]  lock_extent_bits+0x37/0x90 [btrfs]
      [12084.042601]  btrfs_finish_ordered_io.isra.0+0x3fd/0x960 [btrfs]
      [12084.042656]  ? lock_is_held_type+0xe8/0x140
      [12084.042667]  btrfs_work_helper+0xf8/0x400 [btrfs]
      [12084.042716]  ? lock_is_held_type+0xe8/0x140
      [12084.042727]  process_one_work+0x252/0x5a0
      [12084.042742]  worker_thread+0x52/0x3b0
      [12084.042750]  ? process_one_work+0x5a0/0x5a0
      [12084.042754]  kthread+0xf2/0x120
      [12084.042757]  ? kthread_complete_and_exit+0x20/0x20
      [12084.042763]  ret_from_fork+0x22/0x30
      [12084.042783]  </TASK>
      [12084.042798] INFO: task fio:234517 blocked for more than 241 seconds.
      [12084.043598]       Not tainted 5.18.0-rc2-btrfs-next-115 #1
      [12084.044282] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [12084.045244] task:fio             state:D stack:    0 pid:234517 ppid:234515 flags:0x00004000
      [12084.045248] Call Trace:
      [12084.045250]  <TASK>
      [12084.045254]  __schedule+0x3cb/0xed0
      [12084.045263]  schedule+0x4e/0xb0
      [12084.045266]  wait_extent_bit.constprop.0+0x1eb/0x260 [btrfs]
      [12084.045298]  ? prepare_to_wait_exclusive+0xc0/0xc0
      [12084.045306]  lock_extent_bits+0x37/0x90 [btrfs]
      [12084.045336]  btrfs_dio_iomap_begin+0x336/0xc60 [btrfs]
      [12084.045370]  ? lock_is_held_type+0xe8/0x140
      [12084.045378]  iomap_iter+0x184/0x4c0
      [12084.045383]  __iomap_dio_rw+0x2c6/0x8a0
      [12084.045406]  iomap_dio_rw+0xa/0x30
      [12084.045408]  btrfs_do_write_iter+0x370/0x5e0 [btrfs]
      [12084.045440]  aio_write+0xfa/0x2c0
      [12084.045448]  ? __might_fault+0x2a/0x70
      [12084.045451]  ? kvm_sched_clock_read+0x14/0x40
      [12084.045455]  ? lock_release+0x153/0x4a0
      [12084.045463]  io_submit_one+0x615/0x9f0
      [12084.045467]  ? __might_fault+0x2a/0x70
      [12084.045469]  ? kvm_sched_clock_read+0x14/0x40
      [12084.045478]  __x64_sys_io_submit+0x83/0x160
      [12084.045483]  ? syscall_enter_from_user_mode+0x1d/0x50
      [12084.045489]  do_syscall_64+0x3b/0x90
      [12084.045517]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [12084.045521] RIP: 0033:0x7fa76511af79
      [12084.045525] RSP: 002b:00007ffd6d6b9058 EFLAGS: 00000246 ORIG_RAX: 00000000000000d1
      [12084.045530] RAX: ffffffffffffffda RBX: 00007fa75ba6e760 RCX: 00007fa76511af79
      [12084.045532] RDX: 0000557b304ff3f0 RSI: 0000000000000001 RDI: 00007fa75ba4c000
      [12084.045535] RBP: 00007fa75ba4c000 R08: 00007fa751b76000 R09: 0000000000000330
      [12084.045537] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
      [12084.045540] R13: 0000000000000000 R14: 0000557b304ff3f0 R15: 0000557b30521eb0
      [12084.045561]  </TASK>
      
      Fix this issue by always reserving data space before locking a file range
      at btrfs_dio_iomap_begin(). If we can't reserve the space, then we don't
      error out immediately - instead after locking the file range, check if we
      can do a NOCOW write, and if we can we don't error out since we don't need
      to allocate a data extent, however if we can't NOCOW then error out with
      -ENOSPC. This also implies that we may end up reserving space when it's
      not needed because the write will end up being done in NOCOW mode - in that
      case we just release the space after we noticed we did a NOCOW write - this
      is the same type of logic that is done in the path for buffered IO writes.
      
      Fixes: f0bfa76a ("btrfs: fix ENOSPC failure when attempting direct IO write into NOCOW range")
      CC: stable@vger.kernel.org # 5.17+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f5585f4f
    • Goldwyn Rodrigues's avatar
      btrfs: derive compression type from extent map during reads · 1d8fa2e2
      Goldwyn Rodrigues authored
      Derive the compression type from extent map as opposed to the bio flags
      passed. This makes it more precise and not reliant on function
      parameters.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1d8fa2e2
    • Qu Wenruo's avatar
      btrfs: scrub: move scrub_remap_extent() call into scrub_extent() · a13467ee
      Qu Wenruo authored
      [SUSPICIOUS CODE]
      When refactoring scrub code, I noticed a very strange behavior around
      scrub_remap_extent():
      
      	if (sctx->is_dev_replace)
      		scrub_remap_extent(fs_info, cur_logical, scrub_len,
      				   &cur_physical, &target_dev, &cur_mirror);
      
      As replace target is a 1:1 copy of the source device, thus physical
      offset inside the target should be the same as physical inside source,
      thus this remap call makes no sense to me.
      
      [REAL FUNCTIONALITY]
      After more investigation, the function name scrub_remap_extent()
      doesn't tell anything of the truth, nor does its if () condition.
      
      The real story behind this function is that, for scrub_pages() we never
      expect missing device, even for replacing missing device.
      
      What scrub_remap_extent() is really doing is to find a live mirror, and
      make later scrub_pages() to read data from the good copy, other than
      from the missing device and increase error counters unnecessarily.
      
      [IMPROVEMENT]
      We have no need to bother scrub_remap_extent() in scrub_simple_mirror()
      at all, we only need to call it before we call scrub_pages().
      
      And rename the function to scrub_find_live_copy(), add extra comments on
      them.
      
      By this we can remove one parameter from scrub_extent(), and reduce the
      unnecessary calls to scrub_remap_extent() for regular replace.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a13467ee