1. 07 Jul, 2021 7 commits
    • Filipe Manana's avatar
      btrfs: zoned: fix wrong mutex unlock on failure to allocate log root tree · ea32af47
      Filipe Manana authored
      When syncing the log, if we fail to allocate the root node for the log
      root tree:
      
      1) We are unlocking fs_info->tree_log_mutex, but at this point we have
         not yet locked this mutex;
      
      2) We have locked fs_info->tree_root->log_mutex, but we end up not
         unlocking it;
      
      So fix this by unlocking fs_info->tree_root->log_mutex instead of
      fs_info->tree_log_mutex.
      
      Fixes: e75f9fd1 ("btrfs: zoned: move log tree node allocation out of log_root_tree->log_mutex")
      CC: stable@vger.kernel.org # 5.13+
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ea32af47
    • Johannes Thumshirn's avatar
      btrfs: don't block if we can't acquire the reclaim lock · 9cc0b837
      Johannes Thumshirn authored
      If we can't acquire the reclaim_bgs_lock on block group reclaim, we
      block until it is free. This can potentially stall for a long time.
      
      While reclaim of block groups is necessary for a good user experience on
      a zoned file system, there still is no need to block as it is best
      effort only, just like when we're deleting unused block groups.
      
      CC: stable@vger.kernel.org # 5.13
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9cc0b837
    • Naohiro Aota's avatar
      btrfs: properly split extent_map for REQ_OP_ZONE_APPEND · abb99cfd
      Naohiro Aota authored
      Damien reported a test failure with btrfs/209. The test itself ran fine,
      but the fsck ran afterwards reported a corrupted filesystem.
      
      The filesystem corruption happens because we're splitting an extent and
      then writing the extent twice. We have to split the extent though, because
      we're creating too large extents for a REQ_OP_ZONE_APPEND operation.
      
      When dumping the extent tree, we can see two EXTENT_ITEMs at the same
      start address but different lengths.
      
      $ btrfs inspect dump-tree /dev/nullb1 -t extent
      ...
         item 19 key (269484032 EXTENT_ITEM 126976) itemoff 15470 itemsize 53
                 refs 1 gen 7 flags DATA
                 extent data backref root FS_TREE objectid 257 offset 786432 count 1
         item 20 key (269484032 EXTENT_ITEM 262144) itemoff 15417 itemsize 53
                 refs 1 gen 7 flags DATA
                 extent data backref root FS_TREE objectid 257 offset 786432 count 1
      
      The duplicated EXTENT_ITEMs originally come from wrongly split extent_map in
      extract_ordered_extent(). Since extract_ordered_extent() uses
      create_io_em() to split an existing extent_map, we will have
      split->orig_start != split->start. Then, it will be logged with non-zero
      "extent data offset". Finally, the logged entries are replayed into
      a duplicated EXTENT_ITEM.
      
      Introduce and use proper splitting function for extent_map. The function is
      intended to be simple and specific usage for extract_ordered_extent() e.g.
      not supporting compression case (we do not allow splitting compressed
      extent_map anyway).
      
      There was a question raised by Qu, in summary why we want to split the
      extent map (and not the bio):
      
      The problem is not the limit on the zone end, which as you mention is
      the same as the block group end. The problem is that data write use zone
      append (ZA) operations. ZA BIOs cannot be split so a large extent may
      need to be processed with multiple ZA BIOs, While that is also true for
      regular writes, the major difference is that ZA are "nameless" write
      operation giving back the written sectors on completion. And ZA
      operations may be reordered by the block layer (not intentionally
      though). Combine both of these characteristics and you can see that the
      data for a large extent may end up being shuffled when written resulting
      in data corruption and the impossibility to map the extent to some start
      sector.
      
      To avoid this problem, zoned btrfs uses the principle "one data extent
      == one ZA BIO". So large extents need to be split. This is unfortunate,
      but we can revisit this later and optimize, e.g. merge back together the
      fragments of an extent once written if they actually were written
      sequentially in the zone.
      Reported-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Fixes: d22002fd ("btrfs: zoned: split ordered extent when bio is sent")
      CC: stable@vger.kernel.org # 5.12+
      CC: Johannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      abb99cfd
    • Filipe Manana's avatar
      btrfs: rework chunk allocation to avoid exhaustion of the system chunk array · 79bd3712
      Filipe Manana authored
      Commit eafa4fd0 ("btrfs: fix exhaustion of the system chunk array
      due to concurrent allocations") fixed a problem that resulted in
      exhausting the system chunk array in the superblock when there are many
      tasks allocating chunks in parallel. Basically too many tasks enter the
      first phase of chunk allocation without previous tasks having finished
      their second phase of allocation, resulting in too many system chunks
      being allocated. That was originally observed when running the fallocate
      tests of stress-ng on a PowerPC machine, using a node size of 64K.
      
      However that commit also introduced a deadlock where a task in phase 1 of
      the chunk allocation waited for another task that had allocated a system
      chunk to finish its phase 2, but that other task was waiting on an extent
      buffer lock held by the first task, therefore resulting in both tasks not
      making any progress. That change was later reverted by a patch with the
      subject "btrfs: fix deadlock with concurrent chunk allocations involving
      system chunks", since there is no simple and short solution to address it
      and the deadlock is relatively easy to trigger on zoned filesystems, while
      the system chunk array exhaustion is not so common.
      
      This change reworks the chunk allocation to avoid the system chunk array
      exhaustion. It accomplishes that by making the first phase of chunk
      allocation do the updates of the device items in the chunk btree and the
      insertion of the new chunk item in the chunk btree. This is done while
      under the protection of the chunk mutex (fs_info->chunk_mutex), in the
      same critical section that checks for available system space, allocates
      a new system chunk if needed and reserves system chunk space. This way
      we do not have chunk space reserved until the second phase completes.
      
      The same logic is applied to chunk removal as well, since it keeps
      reserved system space long after it is done updating the chunk btree.
      
      For direct allocation of system chunks, the previous behaviour remains,
      because otherwise we would deadlock on extent buffers of the chunk btree.
      Changes to the chunk btree are by large done by chunk allocation and chunk
      removal, which first reserve chunk system space and then later do changes
      to the chunk btree. The other remaining cases are uncommon and correspond
      to adding a device, removing a device and resizing a device. All these
      other cases do not pre-reserve system space, they modify the chunk btree
      right away, so they don't hold reserved space for a long period like chunk
      allocation and chunk removal do.
      
      The diff of this change is huge, but more than half of it is just addition
      of comments describing both how things work regarding chunk allocation and
      removal, including both the new behavior and the parts of the old behavior
      that did not change.
      
      CC: stable@vger.kernel.org # 5.12+
      Tested-by: default avatarShin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
      Tested-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Tested-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      79bd3712
    • Filipe Manana's avatar
      btrfs: fix deadlock with concurrent chunk allocations involving system chunks · 1cb3db1c
      Filipe Manana authored
      When a task attempting to allocate a new chunk verifies that there is not
      currently enough free space in the system space_info and there is another
      task that allocated a new system chunk but it did not finish yet the
      creation of the respective block group, it waits for that other task to
      finish creating the block group. This is to avoid exhaustion of the system
      chunk array in the superblock, which is limited, when we have a thundering
      herd of tasks allocating new chunks. This problem was described and fixed
      by commit eafa4fd0 ("btrfs: fix exhaustion of the system chunk array
      due to concurrent allocations").
      
      However there are two very similar scenarios where this can lead to a
      deadlock:
      
      1) Task B allocated a new system chunk and task A is waiting on task B
         to finish creation of the respective system block group. However before
         task B ends its transaction handle and finishes the creation of the
         system block group, it attempts to allocate another chunk (like a data
         chunk for an fallocate operation for a very large range). Task B will
         be unable to progress and allocate the new chunk, because task A set
         space_info->chunk_alloc to 1 and therefore it loops at
         btrfs_chunk_alloc() waiting for task A to finish its chunk allocation
         and set space_info->chunk_alloc to 0, but task A is waiting on task B
         to finish creation of the new system block group, therefore resulting
         in a deadlock;
      
      2) Task B allocated a new system chunk and task A is waiting on task B to
         finish creation of the respective system block group. By the time that
         task B enter the final phase of block group allocation, which happens
         at btrfs_create_pending_block_groups(), when it modifies the extent
         tree, the device tree or the chunk tree to insert the items for some
         new block group, it needs to allocate a new chunk, so it ends up at
         btrfs_chunk_alloc() and keeps looping there because task A has set
         space_info->chunk_alloc to 1, but task A is waiting for task B to
         finish creation of the new system block group and release the reserved
         system space, therefore resulting in a deadlock.
      
      In short, the problem is if a task B needs to allocate a new chunk after
      it previously allocated a new system chunk and if another task A is
      currently waiting for task B to complete the allocation of the new system
      chunk.
      
      Unfortunately this deadlock scenario introduced by the previous fix for
      the system chunk array exhaustion problem does not have a simple and short
      fix, and requires a big change to rework the chunk allocation code so that
      chunk btree updates are all made in the first phase of chunk allocation.
      And since this deadlock regression is being frequently hit on zoned
      filesystems and the system chunk array exhaustion problem is triggered
      in more extreme cases (originally observed on PowerPC with a node size
      of 64K when running the fallocate tests from stress-ng), revert the
      changes from that commit. The next patch in the series, with a subject
      of "btrfs: rework chunk allocation to avoid exhaustion of the system
      chunk array" does the necessary changes to fix the system chunk array
      exhaustion problem.
      Reported-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Link: https://lore.kernel.org/linux-btrfs/20210621015922.ewgbffxuawia7liz@naota-xeon/
      Fixes: eafa4fd0 ("btrfs: fix exhaustion of the system chunk array due to concurrent allocations")
      CC: stable@vger.kernel.org # 5.12+
      Tested-by: default avatarShin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
      Tested-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Tested-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1cb3db1c
    • Johannes Thumshirn's avatar
      btrfs: zoned: print unusable percentage when reclaiming block groups · 5f93e776
      Johannes Thumshirn authored
      When we're automatically reclaiming a zone, because its zone_unusable
      value is above the reclaim threshold, we're only logging how much
      percent of the zone's capacity are used, but not how much of the
      capacity is unusable.
      
      Also print the percentage of the unusable space in the block group
      before we're reclaiming it.
      
      Example:
      
        BTRFS info (device sdg): reclaiming chunk 230686720 with 13% used 86% unusable
      
      CC: stable@vger.kernel.org # 5.13
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5f93e776
    • David Sterba's avatar
      btrfs: zoned: fix types for u64 division in btrfs_reclaim_bgs_work · 54afaae3
      David Sterba authored
      The types in calculation of the used percentage in the reclaiming
      messages are both u64, though bg->length is either 1GiB (non-zoned) or
      the zone size in the zoned mode. The upper limit on zone size is 8GiB so
      this could theoretically overflow in the future, right now the values
      fit.
      
      Fixes: 18bb8bbf ("btrfs: zoned: automatically reclaim zones")
      CC: stable@vger.kernel.org # 5.13
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      54afaae3
  2. 22 Jun, 2021 18 commits
    • Nikolay Borisov's avatar
      btrfs: remove unused btrfs_fs_info::total_pinned · 629e33a1
      Nikolay Borisov authored
      This got added 14 years ago in 324ae4df ("Btrfs: Add block group
      pinned accounting back") but it was not ever used. Subsequently its
      usage got gradually removed in 8790d502 ("Btrfs: Add support for
      mirroring across drives") and 11833d66 ("Btrfs: improve async block
      group caching"). Let's remove it for good!
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      629e33a1
    • Josef Bacik's avatar
      btrfs: rip out btrfs_space_info::total_bytes_pinned · 138a12d8
      Josef Bacik authored
      We used this in may_commit_transaction() in order to determine if we
      needed to commit the transaction.  However we no longer have that logic
      and thus have no use of this counter anymore, so delete it.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      138a12d8
    • Josef Bacik's avatar
      btrfs: rip the first_ticket_bytes logic from fail_all_tickets · 3ffad696
      Josef Bacik authored
      This was a trick implemented to handle the case where we had a giant
      reservation in front of a bunch of little reservations in the ticket
      queue.  If the giant reservation was too large for the transaction
      commit to make a difference we'd ENOSPC everybody out instead of
      committing the transaction.  This logic was put in to force us to go
      back and re-try the transaction commit logic to see if we could make
      progress.
      
      Instead now we know we've committed the transaction, so any space that
      would have been recovered is now available, and would be caught by the
      btrfs_try_granting_tickets() in this loop, so we no longer need this
      code and can simply delete it.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3ffad696
    • Josef Bacik's avatar
      btrfs: remove FLUSH_DELAYED_REFS from data ENOSPC flushing · 04808553
      Josef Bacik authored
      Since we unconditionally commit the transaction now we no longer need to
      run the delayed refs to make sure our total_bytes_pinned value is
      uptodate, we can simply commit the transaction.  Remove this stage from
      the data flushing list.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      04808553
    • Josef Bacik's avatar
      btrfs: rip out may_commit_transaction · c416a30c
      Josef Bacik authored
      may_commit_transaction was introduced before the ticketing
      infrastructure existed.  There was a problem where we'd legitimately be
      out of space, but every reservation would trigger a transaction commit
      and then fail.  Thus if you had 1000 things trying to make a
      reservation, they'd all do the flushing loop and thus commit the
      transaction 1000 times before they'd get their ENOSPC.
      
      This helper was introduced to short circuit this, if there wasn't space
      that could be reclaimed by committing the transaction then simply ENOSPC
      out.  This made true ENOSPC tests much faster as we didn't waste a bunch
      of time.
      
      However many of our bugs over the years have been from cases where we
      didn't account for some space that would be reclaimed by committing a
      transaction.  The delayed refs rsv space, delayed rsv, many pinned bytes
      miscalculations, etc.  And in the meantime the original problem has been
      solved with ticketing.  We no longer will commit the transaction 1000
      times.  Instead we'll get 1000 waiters, we will go through the flushing
      mechanisms, and if there's no progress after 2 loops we ENOSPC everybody
      out.  The ticketing infrastructure gives us a deterministic way to see
      if we're making progress or not, thus we avoid a lot of extra work.
      
      So simplify this step by simply unconditionally committing the
      transaction.  This removes what is arguably our most common source of
      early ENOSPC bugs and will allow us to drastically simplify many of the
      things we track because we simply won't need them with this stuff gone.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c416a30c
    • Filipe Manana's avatar
      btrfs: send: fix crash when memory allocations trigger reclaim · 35b22c19
      Filipe Manana authored
      When doing a send we don't expect the task to ever start a transaction
      after the initial check that verifies if commit roots match the regular
      roots. This is because after that we set current->journal_info with a
      stub (special value) that signals we are in send context, so that we take
      a read lock on an extent buffer when reading it from disk and verifying
      it is valid (its generation matches the generation stored in the parent).
      This stub was introduced in 2014 by commit a26e8c9f ("Btrfs: don't
      clear uptodate if the eb is under IO") in order to fix a concurrency issue
      between send and balance.
      
      However there is one particular exception where we end up needing to start
      a transaction and when this happens it results in a crash with a stack
      trace like the following:
      
      [60015.902283] kernel: WARNING: CPU: 3 PID: 58159 at arch/x86/include/asm/kfence.h:44 kfence_protect_page+0x21/0x80
      [60015.902292] kernel: Modules linked in: uinput rfcomm snd_seq_dummy (...)
      [60015.902384] kernel: CPU: 3 PID: 58159 Comm: btrfs Not tainted 5.12.9-300.fc34.x86_64 #1
      [60015.902387] kernel: Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./F2A88XN-WIFI, BIOS F6 12/24/2015
      [60015.902389] kernel: RIP: 0010:kfence_protect_page+0x21/0x80
      [60015.902393] kernel: Code: ff 0f 1f 84 00 00 00 00 00 55 48 89 fd (...)
      [60015.902396] kernel: RSP: 0018:ffff9fb583453220 EFLAGS: 00010246
      [60015.902399] kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff9fb583453224
      [60015.902401] kernel: RDX: ffff9fb583453224 RSI: 0000000000000000 RDI: 0000000000000000
      [60015.902402] kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
      [60015.902404] kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002
      [60015.902406] kernel: R13: ffff9fb583453348 R14: 0000000000000000 R15: 0000000000000001
      [60015.902408] kernel: FS:  00007f158e62d8c0(0000) GS:ffff93bd37580000(0000) knlGS:0000000000000000
      [60015.902410] kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [60015.902412] kernel: CR2: 0000000000000039 CR3: 00000001256d2000 CR4: 00000000000506e0
      [60015.902414] kernel: Call Trace:
      [60015.902419] kernel:  kfence_unprotect+0x13/0x30
      [60015.902423] kernel:  page_fault_oops+0x89/0x270
      [60015.902427] kernel:  ? search_module_extables+0xf/0x40
      [60015.902431] kernel:  ? search_bpf_extables+0x57/0x70
      [60015.902435] kernel:  kernelmode_fixup_or_oops+0xd6/0xf0
      [60015.902437] kernel:  __bad_area_nosemaphore+0x142/0x180
      [60015.902440] kernel:  exc_page_fault+0x67/0x150
      [60015.902445] kernel:  asm_exc_page_fault+0x1e/0x30
      [60015.902450] kernel: RIP: 0010:start_transaction+0x71/0x580
      [60015.902454] kernel: Code: d3 0f 84 92 00 00 00 80 e7 06 0f 85 63 (...)
      [60015.902456] kernel: RSP: 0018:ffff9fb5834533f8 EFLAGS: 00010246
      [60015.902458] kernel: RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000000
      [60015.902460] kernel: RDX: 0000000000000801 RSI: 0000000000000000 RDI: 0000000000000039
      [60015.902462] kernel: RBP: ffff93bc0a7eb800 R08: 0000000000000001 R09: 0000000000000000
      [60015.902463] kernel: R10: 0000000000098a00 R11: 0000000000000001 R12: 0000000000000001
      [60015.902464] kernel: R13: 0000000000000000 R14: ffff93bc0c92b000 R15: ffff93bc0c92b000
      [60015.902468] kernel:  btrfs_commit_inode_delayed_inode+0x5d/0x120
      [60015.902473] kernel:  btrfs_evict_inode+0x2c5/0x3f0
      [60015.902476] kernel:  evict+0xd1/0x180
      [60015.902480] kernel:  inode_lru_isolate+0xe7/0x180
      [60015.902483] kernel:  __list_lru_walk_one+0x77/0x150
      [60015.902487] kernel:  ? iput+0x1a0/0x1a0
      [60015.902489] kernel:  ? iput+0x1a0/0x1a0
      [60015.902491] kernel:  list_lru_walk_one+0x47/0x70
      [60015.902495] kernel:  prune_icache_sb+0x39/0x50
      [60015.902497] kernel:  super_cache_scan+0x161/0x1f0
      [60015.902501] kernel:  do_shrink_slab+0x142/0x240
      [60015.902505] kernel:  shrink_slab+0x164/0x280
      [60015.902509] kernel:  shrink_node+0x2c8/0x6e0
      [60015.902512] kernel:  do_try_to_free_pages+0xcb/0x4b0
      [60015.902514] kernel:  try_to_free_pages+0xda/0x190
      [60015.902516] kernel:  __alloc_pages_slowpath.constprop.0+0x373/0xcc0
      [60015.902521] kernel:  ? __memcg_kmem_charge_page+0xc2/0x1e0
      [60015.902525] kernel:  __alloc_pages_nodemask+0x30a/0x340
      [60015.902528] kernel:  pipe_write+0x30b/0x5c0
      [60015.902531] kernel:  ? set_next_entity+0xad/0x1e0
      [60015.902534] kernel:  ? switch_mm_irqs_off+0x58/0x440
      [60015.902538] kernel:  __kernel_write+0x13a/0x2b0
      [60015.902541] kernel:  kernel_write+0x73/0x150
      [60015.902543] kernel:  send_cmd+0x7b/0xd0
      [60015.902545] kernel:  send_extent_data+0x5a3/0x6b0
      [60015.902549] kernel:  process_extent+0x19b/0xed0
      [60015.902551] kernel:  btrfs_ioctl_send+0x1434/0x17e0
      [60015.902554] kernel:  ? _btrfs_ioctl_send+0xe1/0x100
      [60015.902557] kernel:  _btrfs_ioctl_send+0xbf/0x100
      [60015.902559] kernel:  ? enqueue_entity+0x18c/0x7b0
      [60015.902562] kernel:  btrfs_ioctl+0x185f/0x2f80
      [60015.902564] kernel:  ? psi_task_change+0x84/0xc0
      [60015.902569] kernel:  ? _flat_send_IPI_mask+0x21/0x40
      [60015.902572] kernel:  ? check_preempt_curr+0x2f/0x70
      [60015.902576] kernel:  ? selinux_file_ioctl+0x137/0x1e0
      [60015.902579] kernel:  ? expand_files+0x1cb/0x1d0
      [60015.902582] kernel:  ? __x64_sys_ioctl+0x82/0xb0
      [60015.902585] kernel:  __x64_sys_ioctl+0x82/0xb0
      [60015.902588] kernel:  do_syscall_64+0x33/0x40
      [60015.902591] kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [60015.902595] kernel: RIP: 0033:0x7f158e38f0ab
      [60015.902599] kernel: Code: ff ff ff 85 c0 79 9b (...)
      [60015.902602] kernel: RSP: 002b:00007ffcb2519bf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
      [60015.902605] kernel: RAX: ffffffffffffffda RBX: 00007ffcb251ae00 RCX: 00007f158e38f0ab
      [60015.902607] kernel: RDX: 00007ffcb2519cf0 RSI: 0000000040489426 RDI: 0000000000000004
      [60015.902608] kernel: RBP: 0000000000000004 R08: 00007f158e297640 R09: 00007f158e297640
      [60015.902610] kernel: R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000000
      [60015.902612] kernel: R13: 0000000000000002 R14: 00007ffcb251aee0 R15: 0000558c1a83e2a0
      [60015.902615] kernel: ---[ end trace 7bbc33e23bb887ae ]---
      
      This happens because when writing to the pipe, by calling kernel_write(),
      we end up doing page allocations using GFP_HIGHUSER | __GFP_ACCOUNT as the
      gfp flags, which allow reclaim to happen if there is memory pressure. This
      allocation happens at fs/pipe.c:pipe_write().
      
      If the reclaim is triggered, inode eviction can be triggered and that in
      turn can result in starting a transaction if the inode has a link count
      of 0. The transaction start happens early on during eviction, when we call
      btrfs_commit_inode_delayed_inode() at btrfs_evict_inode(). This happens if
      there is currently an open file descriptor for an inode with a link count
      of 0 and the reclaim task gets a reference on the inode before that
      descriptor is closed, in which case the reclaim task ends up doing the
      final iput that triggers the inode eviction.
      
      When we have assertions enabled (CONFIG_BTRFS_ASSERT=y), this triggers
      the following assertion at transaction.c:start_transaction():
      
          /* Send isn't supposed to start transactions. */
          ASSERT(current->journal_info != BTRFS_SEND_TRANS_STUB);
      
      And when assertions are not enabled, it triggers a crash since after that
      assertion we cast current->journal_info into a transaction handle pointer
      and then dereference it:
      
         if (current->journal_info) {
             WARN_ON(type & TRANS_EXTWRITERS);
             h = current->journal_info;
             refcount_inc(&h->use_count);
             (...)
      
      Which obviously results in a crash due to an invalid memory access.
      
      The same type of issue can happen during other memory allocations we
      do directly in the send code with kmalloc (and friends) as they use
      GFP_KERNEL and therefore may trigger reclaim too, which started to
      happen since 2016 after commit e780b0d1 ("btrfs: send: use
      GFP_KERNEL everywhere").
      
      The issue could be solved by setting up a NOFS context for the entire
      send operation so that reclaim could not be triggered when allocating
      memory or pages through kernel_write(). However that is not very friendly
      and we can in fact get rid of the send stub because:
      
      1) The stub was introduced way back in 2014 by commit a26e8c9f
         ("Btrfs: don't clear uptodate if the eb is under IO") to solve an
         issue exclusive to when send and balance are running in parallel,
         however there were other problems between balance and send and we do
         not allow anymore to have balance and send run concurrently since
         commit 9e967495 ("Btrfs: prevent send failures and crashes due
         to concurrent relocation"). More generically the issues are between
         send and relocation, and that last commit eliminated only the
         possibility of having send and balance run concurrently, but shrinking
         a device also can trigger relocation, and on zoned filesystems we have
         relocation of partially used block groups triggered automatically as
         well. The previous patch that has a subject of:
      
         "btrfs: ensure relocation never runs while we have send operations running"
      
         Addresses all the remaining cases that can trigger relocation.
      
      2) We can actually allow starting and even committing transactions while
         in a send context if needed because send is not holding any locks that
         would block the start or the commit of a transaction.
      
      So get rid of all the logic added by commit a26e8c9f ("Btrfs: don't
      clear uptodate if the eb is under IO"). We can now always call
      clear_extent_buffer_uptodate() at verify_parent_transid() since send is
      the only case that uses commit roots without having a transaction open or
      without holding the commit_root_sem.
      Reported-by: default avatarChris Murphy <lists@colorremedies.com>
      Link: https://lore.kernel.org/linux-btrfs/CAJCQCtRQ57=qXo3kygwpwEBOU_CA_eKvdmjP52sU=eFvuVOEGw@mail.gmail.com/Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      35b22c19
    • Filipe Manana's avatar
      btrfs: ensure relocation never runs while we have send operations running · 1cea5cf0
      Filipe Manana authored
      Relocation and send do not play well together because while send is
      running a block group can be relocated, a transaction committed and
      the respective disk extents get re-allocated and written to or discarded
      while send is about to do something with the extents.
      
      This was explained in commit 9e967495 ("Btrfs: prevent send failures
      and crashes due to concurrent relocation"), which prevented balance and
      send from running in parallel but it did not address one remaining case
      where chunk relocation can happen: shrinking a device (and device deletion
      which shrinks a device's size to 0 before deleting the device).
      
      We also have now one more case where relocation is triggered: on zoned
      filesystems partially used block groups get relocated by a background
      thread, introduced in commit 18bb8bbf ("btrfs: zoned: automatically
      reclaim zones").
      
      So make sure that instead of preventing balance from running when there
      are ongoing send operations, we prevent relocation from happening.
      This uses the infrastructure recently added by a patch that has the
      subject: "btrfs: add cancellable chunk relocation support".
      
      Also it adds a spinlock used exclusively for the exclusivity between
      send and relocation, as before fs_info->balance_mutex was used, which
      would make an attempt to run send to block waiting for balance to
      finish, which can take a lot of time on large filesystems.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1cea5cf0
    • David Sterba's avatar
      btrfs: shorten integrity checker extent data mount option · cbeaae4f
      David Sterba authored
      Subjectively, CHECK_INTEGRITY_INCLUDING_EXTENT_DATA is quite long and
      calling it CHECK_INTEGRITY_DATA still keeps the meaning and matches the
      mount option name.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cbeaae4f
    • David Sterba's avatar
      btrfs: switch mount option bits to enums and use wider type · ccd9395b
      David Sterba authored
      Switch defines of BTRFS_MOUNT_* to an enum (the symbolic names are
      recorded in the debugging information for convenience).
      
      There are two more things done but separating them would not make much
      sense as it's touching the same lines:
      
      - Renumber shifts 18..31 to 17..30 to get rid of the hole in the
        sequence.
      
      - Use 1UL as the value that gets shifted because we're approaching the
        32bit limit and due to integer promotions the value of (1 << 31)
        becomes 0xffffffff80000000 when cast to unsigned long (eg. the option
        manipulating helpers).
      
        This is not causing any problems yet as the operations are in-memory
        and masking the 31st bit works, we don't have more than 31 bits so the
        ill effects of not masking higher bits don't happen. But once we have
        more, the problems will emerge.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ccd9395b
    • David Sterba's avatar
      btrfs: props: change how empty value is interpreted · 5548c8c6
      David Sterba authored
      Based on user feedback and actual problems with compression property,
      there's no support to unset any compression options, or to force no
      compression flag.
      
      Note: This has changed recently in e2fsprogs 1.46.2, 'chattr +m'
      (setting NOCOMPRESS).
      
      In btrfs properties, the empty value should really mean reset to
      defaults, for all properties in general. Right now there's only the
      compression one, so this change should not cause too many problems.
      
      Old behaviour:
      
        $ lsattr file
        ---------------------- file
        # the NOCOMPRESS bit is set
        $ btrfs prop set file compression ''
        $ lsattr file
        ---------------------m file
      
      This is equivalent to 'btrfs prop set file compression no' in current
      btrfs-progs as the 'no' or 'none' values are translated to an empty
      string.
      
      This is where the new behaviour is different: empty string drops the
      compression flag (-c) and nocompress (-m):
      
        $ lsattr file
        ---------------------- file
        # No change
        $ btrfs prop set file compression ''
        $ lsattr file
        ---------------------- file
        $ btrfs prop set file compression lzo
        $ lsattr file
        --------c------------- file
        $ btrfs prop get file compression
        compression=lzo
        $ btrfs prop set file compression ''
        # Reset to the initial state
        $ lsattr file
        ---------------------- file
        # Set NOCOMPRESS bit
        $ btrfs prop set file compression no
        $ lsattr file
        ---------------------m file
      
      This obviously brings problems with backward compatibility, so this
      patch should not be backported without making sure the updated
      btrfs-progs are also used and that scripts have been updated to use the
      new semantics.
      
      Summary:
      
      - old kernel:
        no, none, "" - set NOCOMPRESS bit
      - new kernel:
        no, none - set NOCOMPRESS bit
        "" - drop all compression flags, ie. COMPRESS and NOCOMPRESS
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5548c8c6
    • David Sterba's avatar
      btrfs: compression: don't try to compress if we don't have enough pages · f2165627
      David Sterba authored
      The early check if we should attempt compression does not take into
      account the number of input pages. It can happen that there's only one
      page, eg. a tail page after some ranges of the BTRFS_MAX_UNCOMPRESSED
      have been processed, or an isolated page that won't be converted to an
      inline extent.
      
      The single page would be compressed but a later check would drop it
      again because the result size must be at least one block shorter than
      the input. That can never work with just one page.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f2165627
    • Naohiro Aota's avatar
      btrfs: fix unbalanced unlock in qgroup_account_snapshot() · 44365827
      Naohiro Aota authored
      qgroup_account_snapshot() is trying to unlock the not taken
      tree_log_mutex in a error path. Since ret != 0 in this case, we can
      just return from here.
      
      Fixes: 2a4d84c1 ("btrfs: move delayed ref flushing for qgroup into qgroup helper")
      CC: stable@vger.kernel.org # 5.12+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      44365827
    • David Sterba's avatar
      btrfs: sysfs: export dev stats in devinfo directory · da658b57
      David Sterba authored
      The device stats can be read by ioctl, wrapped by command 'btrfs device
      stats'. Provide another source where to read the information in
      /sys/fs/btrfs/FSID/devinfo/DEVID/error_stats . The format is a list of
      'key value' pairs one per line, which is common in other stat files.
      The names are the same as used in other device stat outputs.
      
      The stats are all in one file as it's the snapshot of all available
      stats. The 'one value per file' format is not very suitable here. The
      stats should be valid right after the stats item is read from disk,
      shortly after initializing the device.
      
      In case the stats are not yet valid, print just 'invalid' as the file
      contents.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      da658b57
    • David Sterba's avatar
      btrfs: fix typos in comments · 1a9fd417
      David Sterba authored
      Fix typos that have snuck in since the last round. Found by codespell.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1a9fd417
    • Qu Wenruo's avatar
      btrfs: remove a stale comment for btrfs_decompress_bio() · c86bdc9b
      Qu Wenruo authored
      Since commit 8140dc30 ("btrfs: btrfs_decompress_bio() could accept
      compressed_bio instead"), btrfs_decompress_bio() accepts
      "struct compressed_bio" other than open-coded parameter list.
      
      Thus the comments for the parameter list is no longer needed.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c86bdc9b
    • Baokun Li's avatar
      btrfs: send: use list_move_tail instead of list_del/list_add_tail · bb930007
      Baokun Li authored
      Use list_move_tail() instead of list_del() + list_add_tail() as it's
      doing the same thing and allows further cleanups.  Open code
      name_cache_used() as there is only one user.
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Signed-off-by: default avatarBaokun Li <libaokun1@huawei.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bb930007
    • Christophe Leroy's avatar
      btrfs: disable build on platforms having page size 256K · b05fbcc3
      Christophe Leroy authored
      With a config having PAGE_SIZE set to 256K, BTRFS build fails
      with the following message
      
        include/linux/compiler_types.h:326:38: error: call to
        '__compiletime_assert_791' declared with attribute error:
        BUILD_BUG_ON failed: (BTRFS_MAX_COMPRESSED % PAGE_SIZE) != 0
      
      BTRFS_MAX_COMPRESSED being 128K, BTRFS cannot support platforms with
      256K pages at the time being.
      
      There are two platforms that can select 256K pages:
       - hexagon
       - powerpc
      
      Disable BTRFS when 256K page size is selected. Supporting this would
      require changes to the subpage mode that's currently being developed.
      Given that 256K is many times larger than page sizes commonly used and
      for what the algorithms and structures have been tuned, it's out of
      scope and disabling build is a reasonable option.
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      [ update changelog ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b05fbcc3
    • Filipe Manana's avatar
      btrfs: send: fix invalid path for unlink operations after parent orphanization · d8ac76cd
      Filipe Manana authored
      During an incremental send operation, when processing the new references
      for the current inode, we might send an unlink operation for another inode
      that has a conflicting path and has more than one hard link. However this
      path was computed and cached before we processed previous new references
      for the current inode. We may have orphanized a directory of that path
      while processing a previous new reference, in which case the path will
      be invalid and cause the receiver process to fail.
      
      The following reproducer triggers the problem and explains how/why it
      happens in its comments:
      
        $ cat test-send-unlink.sh
        #!/bin/bash
      
        DEV=/dev/sdi
        MNT=/mnt/sdi
      
        mkfs.btrfs -f $DEV >/dev/null
        mount $DEV $MNT
      
        # Create our test files and directory. Inode 259 (file3) has two hard
        # links.
        touch $MNT/file1
        touch $MNT/file2
        touch $MNT/file3
      
        mkdir $MNT/A
        ln $MNT/file3 $MNT/A/hard_link
      
        # Filesystem looks like:
        #
        # .                                     (ino 256)
        # |----- file1                          (ino 257)
        # |----- file2                          (ino 258)
        # |----- file3                          (ino 259)
        # |----- A/                             (ino 260)
        #        |---- hard_link                (ino 259)
        #
      
        # Now create the base snapshot, which is going to be the parent snapshot
        # for a later incremental send.
        btrfs subvolume snapshot -r $MNT $MNT/snap1
        btrfs send -f /tmp/snap1.send $MNT/snap1
      
        # Move inode 257 into directory inode 260. This results in computing the
        # path for inode 260 as "/A" and caching it.
        mv $MNT/file1 $MNT/A/file1
      
        # Move inode 258 (file2) into directory inode 260, with a name of
        # "hard_link", moving first inode 259 away since it currently has that
        # location and name.
        mv $MNT/A/hard_link $MNT/tmp
        mv $MNT/file2 $MNT/A/hard_link
      
        # Now rename inode 260 to something else (B for example) and then create
        # a hard link for inode 258 that has the old name and location of inode
        # 260 ("/A").
        mv $MNT/A $MNT/B
        ln $MNT/B/hard_link $MNT/A
      
        # Filesystem now looks like:
        #
        # .                                     (ino 256)
        # |----- tmp                            (ino 259)
        # |----- file3                          (ino 259)
        # |----- B/                             (ino 260)
        # |      |---- file1                    (ino 257)
        # |      |---- hard_link                (ino 258)
        # |
        # |----- A                              (ino 258)
      
        # Create another snapshot of our subvolume and use it for an incremental
        # send.
        btrfs subvolume snapshot -r $MNT $MNT/snap2
        btrfs send -f /tmp/snap2.send -p $MNT/snap1 $MNT/snap2
      
        # Now unmount the filesystem, create a new one, mount it and try to
        # apply both send streams to recreate both snapshots.
        umount $DEV
      
        mkfs.btrfs -f $DEV >/dev/null
      
        mount $DEV $MNT
      
        # First add the first snapshot to the new filesystem by applying the
        # first send stream.
        btrfs receive -f /tmp/snap1.send $MNT
      
        # The incremental receive operation below used to fail with the
        # following error:
        #
        #    ERROR: unlink A/hard_link failed: No such file or directory
        #
        # This is because when send is processing inode 257, it generates the
        # path for inode 260 as "/A", since that inode is its parent in the send
        # snapshot, and caches that path.
        #
        # Later when processing inode 258, it first processes its new reference
        # that has the path of "/A", which results in orphanizing inode 260
        # because there is a a path collision. This results in issuing a rename
        # operation from "/A" to "/o260-6-0".
        #
        # Finally when processing the new reference "B/hard_link" for inode 258,
        # it notices that it collides with inode 259 (not yet processed, because
        # it has a higher inode number), since that inode has the name
        # "hard_link" under the directory inode 260. It also checks that inode
        # 259 has two hardlinks, so it decides to issue a unlink operation for
        # the name "hard_link" for inode 259. However the path passed to the
        # unlink operation is "/A/hard_link", which is incorrect since currently
        # "/A" does not exists, due to the orphanization of inode 260 mentioned
        # before. The path is incorrect because it was computed and cached
        # before the orphanization. This results in the receiver to fail with
        # the above error.
        btrfs receive -f /tmp/snap2.send $MNT
      
        umount $MNT
      
      When running the test, it fails like this:
      
        $ ./test-send-unlink.sh
        Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
        At subvol /mnt/sdi/snap1
        Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
        At subvol /mnt/sdi/snap2
        At subvol snap1
        At snapshot snap2
        ERROR: unlink A/hard_link failed: No such file or directory
      
      Fix this by recomputing a path before issuing an unlink operation when
      processing the new references for the current inode if we previously
      have orphanized a directory.
      
      A test case for fstests will follow soon.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d8ac76cd
  3. 21 Jun, 2021 15 commits
    • David Sterba's avatar
      btrfs: inline wait_current_trans_commit_start in its caller · ae5d29d4
      David Sterba authored
      Function wait_current_trans_commit_start is now fairly trivial so it can
      be inlined in its only caller.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ae5d29d4
    • David Sterba's avatar
      btrfs: sink wait_for_unblock parameter to async commit · 32cc4f87
      David Sterba authored
      There's only one caller left btrfs_ioctl_start_sync that passes 0, so we
      can remove the switch in btrfs_commit_transaction_async.
      
      A cleanup 9babda9f ("btrfs: Remove async_transid from
      btrfs_mksubvol/create_subvol/create_snapshot") removed calls that passed
      1, so this is a followup.
      
      As this removes last call of wait_current_trans_commit_start_and_unblock,
      remove the function as well.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      32cc4f87
    • Nathan Chancellor's avatar
      btrfs: remove total_data_size variable in btrfs_batch_insert_items() · bfaa324e
      Nathan Chancellor authored
      clang warns:
      
        fs/btrfs/delayed-inode.c:684:6: warning: variable 'total_data_size' set
        but not used [-Wunused-but-set-variable]
      	  int total_data_size = 0, total_size = 0;
      	      ^
        1 warning generated.
      
      This variable's value has been unused since commit fc0d82e1 ("btrfs:
      sink total_data parameter in setup_items_for_insert"). Eliminate it.
      
      Link: https://github.com/ClangBuiltLinux/linux/issues/1391Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarNathan Chancellor <nathan@kernel.org>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bfaa324e
    • Nikolay Borisov's avatar
      btrfs: eliminate insert label in add_falloc_range · 77d25534
      Nikolay Borisov authored
      By way of inverting the list_empty conditional the insert label can be
      eliminated, making the function's flow entirely linear.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      77d25534
    • Qu Wenruo's avatar
      btrfs: subpage: fix a rare race between metadata endio and eb freeing · 3d078efa
      Qu Wenruo authored
      [BUG]
      There is a very rare ASSERT() triggering during full fstests run for
      subpage rw support.
      
      No other reproducer so far.
      
      The ASSERT() gets triggered for metadata read in
      btrfs_page_set_uptodate() inside end_page_read().
      
      [CAUSE]
      There is still a small race window for metadata only, the race could
      happen like this:
      
                      T1                  |              T2
      ------------------------------------+-----------------------------
      end_bio_extent_readpage()           |
      |- btrfs_validate_metadata_buffer() |
      |  |- free_extent_buffer()          |
      |     Still have 2 refs             |
      |- end_page_read()                  |
         |- if (unlikely(PagePrivate())   |
         |  The page still has Private    |
         |                                | free_extent_buffer()
         |                                | |  Only one ref 1, will be
         |                                | |  released
         |                                | |- detach_extent_buffer_page()
         |                                |    |- btrfs_detach_subpage()
         |- btrfs_set_page_uptodate()     |
            The page no longer has Private|
            >>> ASSERT() triggered <<<    |
      
      This race window is super small, thus pretty hard to hit, even with so
      many runs of fstests.
      
      But the race window is still there, we have to go another way to solve
      it other than relying on random PagePrivate() check.
      
      Data path is not affected, as it will lock the page before reading,
      while unlocking the page after the last read has finished, thus no race
      window.
      
      [FIX]
      This patch will fix the bug by repurposing btrfs_subpage::readers.
      
      Now btrfs_subpage::readers will be a member shared by both metadata and
      data.
      
      For metadata path, we don't do the page unlock as metadata only relies
      on extent locking.
      
      At the same time, teach page_range_has_eb() to take
      btrfs_subpage::readers into consideration.
      
      So that even if the last eb of a page gets freed, page::private won't be
      detached as long as there still are pending end_page_read() calls.
      
      By this we eliminate the race window, this will slight increase the
      metadata memory usage, as the page may not be released as frequently as
      usual.  But it should not be a big deal.
      
      The code got introduced in ("btrfs: submit read time repair only for
      each corrupted sector"), but the fix is in a separate patch to keep the
      problem description and the crash is rare so it should not hurt
      bisectability.
      Signed-off-by: default avatarQu Wegruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3d078efa
    • Qu Wenruo's avatar
      btrfs: don't clear page extent mapped if we're not invalidating the full page · bcd77455
      Qu Wenruo authored
      [BUG]
      With current btrfs subpage rw support, the following script can lead to
      fs hang:
      
        $ mkfs.btrfs -f -s 4k $dev
        $ mount $dev -o nospace_cache $mnt
        $ fsstress -w -n 100 -p 1 -s 1608140256 -v -d $mnt
      
      The fs will hang at btrfs_start_ordered_extent().
      
      [CAUSE]
      In above test case, btrfs_invalidate() will be called with the following
      parameters:
      
        offset = 0 length = 53248 page dirty = 1 subpage dirty bitmap = 0x2000
      
      Since @offset is 0, btrfs_invalidate() will try to invalidate the full
      page, and finally call clear_page_extent_mapped() which will detach
      subpage structure from the page.
      
      And since the page no longer has subpage structure, the subpage dirty
      bitmap will be cleared, preventing the dirty range from being written
      back, thus no way to wake up the ordered extent.
      
      [FIX]
      Just follow other filesystems, only to invalidate the page if the range
      covers the full page.
      
      There are cases like truncate_setsize() which can call
      btrfs_invalidatepage() with offset == 0 and length != 0 for the last
      page of an inode.
      
      Although the old code will still try to invalidate the full page, we are
      still safe to just wait for ordered extent to finish.
      So it shouldn't cause extra problems.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bcd77455
    • Qu Wenruo's avatar
      btrfs: fix the filemap_range_has_page() call in btrfs_punch_hole_lock_range() · 0528476b
      Qu Wenruo authored
      [BUG]
      With current subpage RW support, the following script can hang the fs
      with 64K page size.
      
       # mkfs.btrfs -f -s 4k $dev
       # mount $dev -o nospace_cache $mnt
       # fsstress -w -n 50 -p 1 -s 1607749395 -d $mnt
      
      The kernel will do an infinite loop in btrfs_punch_hole_lock_range().
      
      [CAUSE]
      In btrfs_punch_hole_lock_range() we:
      
      - Truncate page cache range
      - Lock extent io tree
      - Wait any ordered extents in the range.
      
      We exit the loop until we meet all the following conditions:
      
      - No ordered extent in the lock range
      - No page is in the lock range
      
      The latter condition has a pitfall, it only works for sector size ==
      PAGE_SIZE case.
      
      While can't handle the following subpage case:
      
        0       32K     64K     96K     128K
        |       |///////||//////|       ||
      
      lockstart=32K
      lockend=96K - 1
      
      In this case, although the range crosses 2 pages,
      truncate_pagecache_range() will invalidate no page at all, but only zero
      the [32K, 96K) range of the two pages.
      
      Thus filemap_range_has_page(32K, 96K-1) will always return true, thus we
      will never meet the loop exit condition.
      
      [FIX]
      Fix the problem by doing page alignment for the lock range.
      
      Function filemap_range_has_page() has already handled lend < lstart
      case, we only need to round up @lockstart, and round_down @lockend for
      truncate_pagecache_range().
      
      This modification should not change any thing for sector size ==
      PAGE_SIZE case, as in that case our range is already page aligned.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0528476b
    • Qu Wenruo's avatar
      btrfs: reflink: make copy_inline_to_page() to be subpage compatible · 3115deb3
      Qu Wenruo authored
      The modifications are:
      
      - Page copy destination
        For subpage case, one page can contain multiple sectors, thus we can
        no longer expect the memcpy_to_page()/btrfs_decompress() to copy
        data into page offset 0.
        The correct offset is offset_in_page(file_offset) now, which should
        handle both regular sectorsize and subpage cases well.
      
      - Page status update
        Now we need to use subpage helper to handle the page status update.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3115deb3
    • Qu Wenruo's avatar
      btrfs: make btrfs_page_mkwrite() to be subpage compatible · 2d8ec40e
      Qu Wenruo authored
      Only set_page_dirty() and SetPageUptodate() is not subpage compatible.
      Convert them to subpage helpers, so that __extent_writepage_io() can
      submit page content correctly.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2d8ec40e
    • Qu Wenruo's avatar
      btrfs: make btrfs_truncate_block() to be subpage compatible · 6c9ac8be
      Qu Wenruo authored
      btrfs_truncate_block() itself is already mostly subpage compatible, the
      only missing part is the page dirtying code.
      
      Currently if we have a sector that needs to be truncated, we set the
      sector aligned range delalloc, then set the full page dirty.
      
      The problem is, current subpage code requires subpage dirty bit to be
      set, or __extent_writepage_io() won't submit bio, thus leads to ordered
      extent never to finish.
      
      So this patch will make btrfs_truncate_block() to call
      btrfs_page_set_dirty() helper to replace set_page_dirty() to fix the
      problem.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6c9ac8be
    • Qu Wenruo's avatar
      btrfs: make __extent_writepage_io() only submit dirty range for subpage · c5ef5c6c
      Qu Wenruo authored
      __extent_writepage_io() function originally just iterates through all
      the extent maps of a page, and submits any regular extents.
      
      This is fine for sectorsize == PAGE_SIZE case, as if a page is dirty, we
      need to submit the only sector contained in the page.
      
      But for subpage case, one dirty page can contain several clean sectors
      with at least one dirty sector.
      
      If __extent_writepage_io() still submit all regular extent maps, it can
      submit data which is already written to disk.
      And since such already written data won't have corresponding ordered
      extents, it will trigger a BUG_ON() in btrfs_csum_one_bio().
      
      Change the behavior of __extent_writepage_io() by finding the first
      dirty byte in the page, and only submit the dirty range other than the
      full extent.
      
      Since we're also here, also modify the following calls to be subpage
      compatible:
      
      - SetPageError()
      - end_page_writeback()
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c5ef5c6c
    • Qu Wenruo's avatar
      btrfs: make btrfs_set_range_writeback() subpage compatible · d2a91064
      Qu Wenruo authored
      Function btrfs_set_range_writeback() currently just sets the page
      writeback unconditionally.
      
      Change it to call the subpage helper so that we can handle both cases
      well.
      
      Since the subpage helpers needs btrfs_fs_info, also change the parameter
      to accept btrfs_inode.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d2a91064
    • Qu Wenruo's avatar
      btrfs: prevent extent_clear_unlock_delalloc() to unlock page not locked by __process_pages_contig() · 4750af3b
      Qu Wenruo authored
      In cow_file_range(), after we have succeeded creating an inline extent,
      we unlock the page with extent_clear_unlock_delalloc() by passing
      locked_page == NULL.
      
      For sectorsize == PAGE_SIZE case, this is just making the page lock and
      unlock harder to grab.
      
      But for incoming subpage case, it can be a big problem.
      
      For incoming subpage case, page locking have two entry points:
      
      - __process_pages_contig()
        In that case, we know exactly the range we want to lock (which only
        requires sector alignment).
        To handle the subpage requirement, we introduce btrfs_subpage::writers
        to page::private, and will update it in __process_pages_contig().
      
      - Other directly lock/unlock_page() call sites
        Those won't touch btrfs_subpage::writers at all.
      
      This means, page locked by __process_pages_contig() can only be unlocked
      by __process_pages_contig().
      Thankfully we already have the existing infrastructure in the form of
      @locked_page in various call sites.
      
      Unfortunately, extent_clear_unlock_delalloc() in cow_file_range() after
      creating an inline extent is the exception.
      It intentionally call extent_clear_unlock_delalloc() with locked_page ==
      NULL, to also unlock current page (and clear its dirty/writeback bits).
      
      To co-operate with incoming subpage modifications, and make the page
      lock/unlock pair easier to understand, this patch will still call
      extent_clear_unlock_delalloc() with locked_page, and only unlock the
      page in __extent_writepage().
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4750af3b
    • Qu Wenruo's avatar
      btrfs: update locked page dirty/writeback/error bits in __process_pages_contig · a33a8e9a
      Qu Wenruo authored
      When __process_pages_contig() gets called for
      extent_clear_unlock_delalloc(), if we hit the locked page, only Private2
      bit is updated, but dirty/writeback/error bits are all skipped.
      
      There are several call sites that call extent_clear_unlock_delalloc()
      with locked_page and PAGE_CLEAR_DIRTY/PAGE_SET_WRITEBACK/PAGE_END_WRITEBACK
      
      - cow_file_range()
      - run_delalloc_nocow()
      - cow_file_range_async()
        All for their error handling branches.
      
      For those call sites, since we skip the locked page for
      dirty/error/writeback bit update, the locked page will still have its
      subpage dirty bit remaining.
      
      Normally it's the call sites which locked the page to handle the locked
      page, but it won't hurt if we also do the update.
      
      Especially there are already other call sites doing the same thing by
      manually passing NULL as locked_page.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a33a8e9a
    • Qu Wenruo's avatar
      btrfs: make page Ordered bit to be subpage compatible · b945a463
      Qu Wenruo authored
      This involves the following modification:
      
      - Ordered extent creation
        This is done in process_one_page(), now PAGE_SET_ORDERED will call
        subpage helper to do the work.
      
      - endio functions
        This is done in btrfs_mark_ordered_io_finished().
      
      - btrfs_invalidatepage()
      
      - btrfs_cleanup_ordered_extents()
        Use the subpage page helper, and add an extra branch to exit if the
        locked page have covered the full range.
      
      Now the usage of page Ordered flag for ordered extent accounting is fully
      subpage compatible.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b945a463