1. 25 Jul, 2022 40 commits
    • Christoph Hellwig's avatar
      btrfs: don't call btrfs_page_set_checked in finish_compressed_bio_read · 0b078d9d
      Christoph Hellwig authored
      This flag was used to communicate that the low-level compression code
      already did verify the checksum to the high-level I/O completion code.
      
      But it has been unused for a long time as the upper btrfs_bio for the
      decompressed data had a NULL csum pointer basically since that pointer
      existed and the code already checks for that a little later.
      
      Note that this does not affect the other use of the checked flag, which
      is only used for the COW fixup worker.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0b078d9d
    • Christoph Hellwig's avatar
      btrfs: fix repair of compressed extents · 81bd9328
      Christoph Hellwig authored
      Currently the checksum of compressed extents is verified based on the
      compressed data and the lower btrfs_bio, but the actual repair process
      is driven by end_bio_extent_readpage on the upper btrfs_bio for the
      decompressed data.
      
      This has a bunch of issues, including not being able to properly
      communicate the failed mirror up in case that the I/O submission got
      preempted, a general loss of if an error was an I/O error or a checksum
      verification failure, but most importantly that this design causes
      btrfs_clean_io_failure to eventually write back the uncompressed good
      data onto the disk sectors that are supposed to contain compressed data.
      
      Fix this by moving the repair to the lower btrfs_bio.  To do so, a fair
      amount of code has to be reshuffled:
      
       a) the lower btrfs_bio now needs a valid csum pointer.  The easiest way
          to achieve that is to pass NULL btrfs_lookup_bio_sums and just use
          the btrfs_bio management of csums.  For a compressed_bio that is
          split into multiple btrfs_bios this means additional memory
          allocations, but the code becomes a lot more regular.
       b) checksum verification now runs directly on the lower btrfs_bio instead
          of the compressed_bio.  This actually nicely simplifies the end I/O
          processing.
       c) btrfs_repair_one_sector can't just look up the logical address for
          the file offset any more, as there is no corresponding relative
          offsets that apply to the file offset and the logic address for
          compressed extents.  Instead require that the saved bvec_iter in the
          btrfs_bio is filled out for all read bios and use that, which again
          removes a fair amount of code.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      81bd9328
    • Christoph Hellwig's avatar
      btrfs: remove the start argument to check_data_csum and export · 7959bd44
      Christoph Hellwig authored
      Derive the value of start from the btrfs_bio now that ->file_offset is
      always valid.  Also export and rename the function so it's available
      outside of inode.c as we'll need that soon.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7959bd44
    • Christoph Hellwig's avatar
      btrfs: pass a btrfs_bio to btrfs_repair_one_sector · 7aa51232
      Christoph Hellwig authored
      Pass the btrfs_bio instead of the plain bio to btrfs_repair_one_sector,
      and remove the start and failed_mirror arguments in favor of deriving
      them from the btrfs_bio.  For this to work ensure that the file_offset
      field is also initialized for buffered I/O.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7aa51232
    • Christoph Hellwig's avatar
      btrfs: simplify the pending I/O counting in struct compressed_bio · 524bcd1e
      Christoph Hellwig authored
      Instead of counting the sectors just count the bios, with an extra
      reference held during submission.  This significantly simplifies the
      submission side error handling.
      
      This slightly changes completion and error handling of
      btrfs_submit_compressed_{read,write} because with the old code the
      compressed_bio could have been completed in
      submit_compressed_{read,write} only if there was an error during
      submission for one of the lower bio, whilst with the new code there is a
      chance for this to happen even for successful submission if the all the
      lower bios complete before the end of the function is reached.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      524bcd1e
    • Christoph Hellwig's avatar
      btrfs: repair all known bad mirrors · c144c63f
      Christoph Hellwig authored
      When there is more than a single level of redundancy there can also be
      multiple bad mirrors, and the current read repair code only repairs the
      last bad one.
      
      Restructure btrfs_repair_one_sector so that it records the originally
      failed mirror and the number of copies, and then repair all known bad
      copies until we reach the originally failed copy in clean_io_failure.
      Note that this also means the read repair reads will always start from
      the next bad mirror and not mirror 0.
      
      This fixes btrfs/265 in xfstests.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c144c63f
    • Christoph Hellwig's avatar
      btrfs: merge btrfs_dev_stat_print_on_error with its only caller · d28beb3e
      Christoph Hellwig authored
      Fold it into the only caller.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d28beb3e
    • Filipe Manana's avatar
      btrfs: join running log transaction when logging new name · 723df2bc
      Filipe Manana authored
      When logging a new name, in case of a rename, we pin the log before
      changing it. We then either delete a directory entry from the log or
      insert a key range item to mark the old name for deletion on log replay.
      
      However when doing one of those log changes we may have another task that
      started writing out the log (at btrfs_sync_log()) and it started before
      we pinned the log root. So we may end up changing a log tree while its
      writeback is being started by another task syncing the log. This can lead
      to inconsistencies in a log tree and other unexpected results during log
      replay, because we can get some committed node pointing to a node/leaf
      that ends up not getting written to disk before the next log commit.
      
      The problem, conceptually, started to happen in commit 88d2beec
      ("btrfs: avoid logging all directory changes during renames"), because
      there we started to update the log without joining its current transaction
      first.
      
      However the problem only became visible with commit 259c4b96
      ("btrfs: stop doing unnecessary log updates during a rename"), and that is
      because we used to pin the log at btrfs_rename() and then before entering
      btrfs_log_new_name(), when unlinking the old dentry, we ended up at
      btrfs_del_inode_ref_in_log() and btrfs_del_dir_entries_in_log(). Both
      of them join the current log transaction, effectively waiting for any log
      transaction writeout (due to acquiring the root's log_mutex). This made it
      safe even after leaving the current log transaction, because we remained
      with the log pinned when we called btrfs_log_new_name().
      
      Then in commit 259c4b96 ("btrfs: stop doing unnecessary log updates
      during a rename"), we removed the log pinning from btrfs_rename() and
      stopped calling btrfs_del_inode_ref_in_log() and
      btrfs_del_dir_entries_in_log() during the rename, and started to do all
      the needed work at btrfs_log_new_name(), but without joining the current
      log transaction, only pinning the log, which is racy because another task
      may have started writeout of the log tree right before we pinned the log.
      
      Both commits landed in kernel 5.18, so it doesn't make any practical
      difference which should be blamed, but I'm blaming the second commit only
      because with the first one, by chance, the problem did not happen due to
      the fact we joined the log transaction after pinning the log and unpinned
      it only after calling btrfs_log_new_name().
      
      So make btrfs_log_new_name() join the current log transaction instead of
      pinning it, so that we never do log updates if it's writeout is starting.
      
      Fixes: 259c4b96 ("btrfs: stop doing unnecessary log updates during a rename")
      CC: stable@vger.kernel.org # 5.18+
      Reported-by: default avatarZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Tested-by: default avatarZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      723df2bc
    • Nikolay Borisov's avatar
      btrfs: simplify error handling in btrfs_lookup_dentry · fc8b235f
      Nikolay Borisov authored
      In btrfs_lookup_dentry releasing the reference of the sub_root and the
      running orphan cleanup should only happen if the dentry found actually
      represents a subvolume. This can only be true in the 'else' branch as
      otherwise either fixup_tree_root_location returned an ENOENT error, in
      which case sub_root wouldn't have been changed or if we got a different
      errno this means btrfs_get_fs_root couldn't have executed successfully
      again meaning sub_root will equal to root. So simplify all the branches
      by moving the code into the 'else'.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fc8b235f
    • Filipe Manana's avatar
      btrfs: send: always use the rbtree based inode ref management infrastructure · 0d8869fb
      Filipe Manana authored
      After the patch "btrfs: send: fix sending link commands for existing file
      paths", we now have two infrastructures to detect and eliminate duplicated
      inode references (due to names that got removed and re-added between the
      send and parent snapshots):
      
      1) One that works on a single inode ref/extref item;
      
      2) A new one that works acrosss all ref/extref items for an inode, and
         it's also more efficient because even in the single ref/extref item
         case, it does not do a linear search for all the names encoded in the
         ref/extref item, it uses red black trees to speedup up the search.
      
      There's no good reason to keep both infrastructures, we can use the new
      one everywhere, and it's always more efficient.
      
      So remove the old infrastructure and change all sites that are using it
      to use the new one.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0d8869fb
    • BingJing Chang's avatar
      btrfs: send: fix sending link commands for existing file paths · 3aa5bd36
      BingJing Chang authored
      There is a bug sending link commands for existing file paths. When we're
      processing an inode, we go over all references. All the new file paths are
      added to the "new_refs" list. And all the deleted file paths are added to
      the "deleted_refs" list. In the end, when we finish processing the inode,
      we iterate over all the items in the "new_refs" list and send link commands
      for those file paths. After that, we go over all the items in the
      "deleted_refs" list and send unlink commands for them. If there are
      duplicated file paths in both lists, we will try to create them before we
      remove them. Then the receiver gets an -EEXIST error when trying the link
      operations.
      
      Example for having duplicated file paths in both list:
      
        $ btrfs subvolume create vol
      
        # create a file and 2000 hard links to the same inode
        $ touch vol/foo
        $ for i in {1..2000}; do link vol/foo vol/$i ; done
      
        # take a snapshot for a parent snapshot
        $ btrfs subvolume snapshot -r vol snap1
      
        # remove 2000 hard links and re-create the last 1000 links
        $ for i in {1..2000}; do rm vol/$i; done;
        $ for i in {1001..2000}; do link vol/foo vol/$i; done
      
        # take another one for a send snapshot
        $ btrfs subvolume snapshot -r vol snap2
      
        $ mkdir receive_dir
        $ btrfs send snap2 -p snap1 | btrfs receive receive_dir/
        At subvol snap2
        link 1238 -> foo
        ERROR: link 1238 -> foo failed: File exists
      
      In this case, we will have the same file paths added to both lists. In the
      parent snapshot, reference paths {1..1237} are stored in inode references,
      but reference paths {1238..2000} are stored in inode extended references.
      In the send snapshot, all reference paths {1001..2000} are stored in inode
      references. During the incremental send, we process their inode references
      first. In record_changed_ref(), we iterate all its inode references in the
      send/parent snapshot. For every inode reference, we also use find_iref() to
      check whether the same file path also appears in the parent/send snapshot
      or not. Inode references {1238..2000} which appear in the send snapshot but
      not in the parent snapshot are added to the "new_refs" list. On the other
      hand, Inode references {1..1000} which appear in the parent snapshot but
      not in the send snapshot are added to the "deleted_refs" list. Next, when
      we process their inode extended references, reference paths {1238..2000}
      are added to the "deleted_refs" list because all of them only appear in the
      parent snapshot. Now two lists contain items as below:
      "new_refs" list: {1238..2000}
      "deleted_refs" list: {1..1000}, {1238..2000}
      
      Reference paths {1238..2000} appear in both lists. And as the processing
      order mentioned about before, the receiver gets an -EEXIST error when trying
      the link operations.
      
      To fix the bug, the idea is to process the "deleted_refs" list before
      the "new_refs" list. However, it's not easy to reshuffle the processing
      order. For one reason, if we do so, we may unlink all the existing paths
      first, there's no valid path anymore for links. And it's inefficient
      because we do a bunch of unlinks followed by links for the same paths.
      Moreover, it makes less sense to have duplications in both lists. A
      reference path cannot not only be regarded as new but also has been seen in
      the past, or we won't call it a new path. However, it's also not a good
      idea to make find_iref() check a reference against all inode references
      and all inode extended references because it may result in large disk
      reads.
      
      So we introduce two rbtrees to make the references easier for lookups.
      And we also introduce record_new_ref_if_needed() and
      record_deleted_ref_if_needed() for changed_ref() to check and remove
      duplicated references early.
      Reviewed-by: default avatarRobbie Ko <robbieko@synology.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarBingJing Chang <bingjingc@synology.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3aa5bd36
    • BingJing Chang's avatar
      btrfs: send: introduce recorded_ref_alloc and recorded_ref_free · 71ecfc13
      BingJing Chang authored
      Introduce wrappers to allocate and free recorded_ref structures.
      Reviewed-by: default avatarRobbie Ko <robbieko@synology.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarBingJing Chang <bingjingc@synology.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      71ecfc13
    • Naohiro Aota's avatar
      btrfs: zoned: wait until zone is finished when allocation didn't progress · 2ce543f4
      Naohiro Aota authored
      When the allocated position doesn't progress, we cannot submit IOs to
      finish a block group, but there should be ongoing IOs that will finish a
      block group. So, in that case, we wait for a zone to be finished and retry
      the allocation after that.
      
      Introduce a new flag BTRFS_FS_NEED_ZONE_FINISH for fs_info->flags to
      indicate we need a zone finish to have proceeded. The flag is set when the
      allocator detected it cannot activate a new block group. And, it is cleared
      once a zone is finished.
      
      CC: stable@vger.kernel.org # 5.16+
      Fixes: afba2bc0 ("btrfs: zoned: implement active zone tracking")
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2ce543f4
    • Naohiro Aota's avatar
      btrfs: zoned: write out partially allocated region · 898793d9
      Naohiro Aota authored
      cow_file_range() works in an all-or-nothing way: if it fails to allocate an
      extent for a part of the given region, it gives up all the region including
      the successfully allocated parts. On cow_file_range(), run_delalloc_zoned()
      writes data for the region only when it successfully allocate all the
      region.
      
      This all-or-nothing allocation and write-out are problematic when available
      space in all the block groups are get tight with the active zone
      restriction. btrfs_reserve_extent() try hard to utilize the left space in
      the active block groups and gives up finally and fails with
      -ENOSPC. However, if we send IOs for the successfully allocated region, we
      can finish a zone and can continue on the rest of the allocation on a newly
      allocated block group.
      
      This patch implements the partial write-out for run_delalloc_zoned(). With
      this patch applied, cow_file_range() returns -EAGAIN to tell the caller to
      do something to progress the further allocation, and tells the successfully
      allocated region with done_offset. Furthermore, the zoned extent allocator
      returns -EAGAIN to tell cow_file_range() going back to the caller side.
      
      Actually, we still need to wait for an IO to complete to continue the
      allocation. The next patch implements that part.
      
      CC: stable@vger.kernel.org # 5.16+
      Fixes: afba2bc0 ("btrfs: zoned: implement active zone tracking")
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      898793d9
    • Naohiro Aota's avatar
      btrfs: zoned: activate necessary block group · b6a98021
      Naohiro Aota authored
      There are two places where allocating a chunk is not enough. These two
      places are trying to ensure the space by allocating a chunk. To meet the
      condition for active_total_bytes, we also need to activate a block group
      there.
      
      CC: stable@vger.kernel.org # 5.16+
      Fixes: afba2bc0 ("btrfs: zoned: implement active zone tracking")
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b6a98021
    • Naohiro Aota's avatar
      btrfs: zoned: activate metadata block group on flush_space · b0931513
      Naohiro Aota authored
      For metadata space on zoned filesystem, reaching ALLOC_CHUNK{,_FORCE}
      means we don't have enough space left in the active_total_bytes. Before
      allocating a new chunk, we can try to activate an existing block group
      in this case.
      
      Also, allocating a chunk is not enough to grant a ticket for metadata
      space on zoned filesystem we need to activate the block group to
      increase the active_total_bytes.
      
      btrfs_zoned_activate_one_bg() implements the activation feature. It will
      activate a block group by (maybe) finishing a block group. It will give up
      activating a block group if it cannot finish any block group.
      
      CC: stable@vger.kernel.org # 5.16+
      Fixes: afba2bc0 ("btrfs: zoned: implement active zone tracking")
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b0931513
    • Naohiro Aota's avatar
      btrfs: zoned: disable metadata overcommit for zoned · 79417d04
      Naohiro Aota authored
      The metadata overcommit makes the space reservation flexible but it is also
      harmful to active zone tracking. Since we cannot finish a block group from
      the metadata allocation context, we might not activate a new block group
      and might not be able to actually write out the overcommit reservations.
      
      So, disable metadata overcommit for zoned filesystems. We will ensure
      the reservations are under active_total_bytes in the following patches.
      
      CC: stable@vger.kernel.org # 5.16+
      Fixes: afba2bc0 ("btrfs: zoned: implement active zone tracking")
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      79417d04
    • Naohiro Aota's avatar
      btrfs: zoned: introduce space_info->active_total_bytes · 6a921de5
      Naohiro Aota authored
      The active_total_bytes, like the total_bytes, accounts for the total bytes
      of active block groups in the space_info.
      
      With an introduction of active_total_bytes, we can check if the reserved
      bytes can be written to the block groups without activating a new block
      group. The check is necessary for metadata allocation on zoned
      filesystem. We cannot finish a block group, which may require waiting
      for the current transaction, from the metadata allocation context.
      Instead, we need to ensure the ongoing allocation (reserved bytes) fits
      in active block groups.
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6a921de5
    • Naohiro Aota's avatar
      btrfs: zoned: finish least available block group on data bg allocation · 393f646e
      Naohiro Aota authored
      When we run out of active zones and no sufficient space is left in any
      block groups, we need to finish one block group to make room to activate a
      new block group.
      
      However, we cannot do this for metadata block groups because we can cause a
      deadlock by waiting for a running transaction commit. So, do that only for
      a data block group.
      
      Furthermore, the block group to be finished has two requirements. First,
      the block group must not have reserved bytes left. Having reserved bytes
      means we have an allocated region but did not yet send bios for it. If that
      region is allocated by the thread calling btrfs_zone_finish(), it results
      in a deadlock.
      
      Second, the block group to be finished must not be a SYSTEM block
      group. Finishing a SYSTEM block group easily breaks further chunk
      allocation by nullifying the SYSTEM free space.
      
      In a certain case, we cannot find any zone finish candidate or
      btrfs_zone_finish() may fail. In that case, we fall back to split the
      allocation bytes and fill the last spaces left in the block groups.
      
      CC: stable@vger.kernel.org # 5.16+
      Fixes: afba2bc0 ("btrfs: zoned: implement active zone tracking")
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      393f646e
    • Naohiro Aota's avatar
      btrfs: let can_allocate_chunk return error · bb9950d3
      Naohiro Aota authored
      For the later patch, convert the return type from bool to int and return
      errors. No functional changes.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bb9950d3
    • Naohiro Aota's avatar
      btrfs: use fs_info->max_extent_size in get_extent_max_capacity() · d7601566
      Naohiro Aota authored
      Use fs_info->max_extent_size also in get_extent_max_capacity() for the
      completeness. This is only used for defrag and not really necessary to fix
      the metadata reservation size. But, it still suppresses unnecessary defrag
      operations.
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d7601566
    • Naohiro Aota's avatar
      btrfs: convert count_max_extents() to use fs_info->max_extent_size · 7d7672bc
      Naohiro Aota authored
      If count_max_extents() uses BTRFS_MAX_EXTENT_SIZE to calculate the number
      of extents needed, btrfs release the metadata reservation too much on its
      way to write out the data.
      
      Now that BTRFS_MAX_EXTENT_SIZE is replaced with fs_info->max_extent_size,
      convert count_max_extents() to use it instead, and fix the calculation of
      the metadata reservation.
      
      CC: stable@vger.kernel.org # 5.12+
      Fixes: d8e3fb10 ("btrfs: zoned: use ZONE_APPEND write for zoned mode")
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7d7672bc
    • Naohiro Aota's avatar
      btrfs: replace BTRFS_MAX_EXTENT_SIZE with fs_info->max_extent_size · f7b12a62
      Naohiro Aota authored
      On zoned filesystem, data write out is limited by max_zone_append_size,
      and a large ordered extent is split according the size of a bio. OTOH,
      the number of extents to be written is calculated using
      BTRFS_MAX_EXTENT_SIZE, and that estimated number is used to reserve the
      metadata bytes to update and/or create the metadata items.
      
      The metadata reservation is done at e.g, btrfs_buffered_write() and then
      released according to the estimation changes. Thus, if the number of extent
      increases massively, the reserved metadata can run out.
      
      The increase of the number of extents easily occurs on zoned filesystem
      if BTRFS_MAX_EXTENT_SIZE > max_zone_append_size. And, it causes the
      following warning on a small RAM environment with disabling metadata
      over-commit (in the following patch).
      
      [75721.498492] ------------[ cut here ]------------
      [75721.505624] BTRFS: block rsv 1 returned -28
      [75721.512230] WARNING: CPU: 24 PID: 2327559 at fs/btrfs/block-rsv.c:537 btrfs_use_block_rsv+0x560/0x760 [btrfs]
      [75721.581854] CPU: 24 PID: 2327559 Comm: kworker/u64:10 Kdump: loaded Tainted: G        W         5.18.0-rc2-BTRFS-ZNS+ #109
      [75721.597200] Hardware name: Supermicro Super Server/H12SSL-NT, BIOS 2.0 02/22/2021
      [75721.607310] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
      [75721.616209] RIP: 0010:btrfs_use_block_rsv+0x560/0x760 [btrfs]
      [75721.646649] RSP: 0018:ffffc9000fbdf3e0 EFLAGS: 00010286
      [75721.654126] RAX: 0000000000000000 RBX: 0000000000004000 RCX: 0000000000000000
      [75721.663524] RDX: 0000000000000004 RSI: 0000000000000008 RDI: fffff52001f7be6e
      [75721.672921] RBP: ffffc9000fbdf420 R08: 0000000000000001 R09: ffff889f8d1fc6c7
      [75721.682493] R10: ffffed13f1a3f8d8 R11: 0000000000000001 R12: ffff88980a3c0e28
      [75721.692284] R13: ffff889b66590000 R14: ffff88980a3c0e40 R15: ffff88980a3c0e8a
      [75721.701878] FS:  0000000000000000(0000) GS:ffff889f8d000000(0000) knlGS:0000000000000000
      [75721.712601] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [75721.720726] CR2: 000055d12e05c018 CR3: 0000800193594000 CR4: 0000000000350ee0
      [75721.730499] Call Trace:
      [75721.735166]  <TASK>
      [75721.739886]  btrfs_alloc_tree_block+0x1e1/0x1100 [btrfs]
      [75721.747545]  ? btrfs_alloc_logged_file_extent+0x550/0x550 [btrfs]
      [75721.756145]  ? btrfs_get_32+0xea/0x2d0 [btrfs]
      [75721.762852]  ? btrfs_get_32+0xea/0x2d0 [btrfs]
      [75721.769520]  ? push_leaf_left+0x420/0x620 [btrfs]
      [75721.776431]  ? memcpy+0x4e/0x60
      [75721.781931]  split_leaf+0x433/0x12d0 [btrfs]
      [75721.788392]  ? btrfs_get_token_32+0x580/0x580 [btrfs]
      [75721.795636]  ? push_for_double_split.isra.0+0x420/0x420 [btrfs]
      [75721.803759]  ? leaf_space_used+0x15d/0x1a0 [btrfs]
      [75721.811156]  btrfs_search_slot+0x1bc3/0x2790 [btrfs]
      [75721.818300]  ? lock_downgrade+0x7c0/0x7c0
      [75721.824411]  ? free_extent_buffer.part.0+0x107/0x200 [btrfs]
      [75721.832456]  ? split_leaf+0x12d0/0x12d0 [btrfs]
      [75721.839149]  ? free_extent_buffer.part.0+0x14f/0x200 [btrfs]
      [75721.846945]  ? free_extent_buffer+0x13/0x20 [btrfs]
      [75721.853960]  ? btrfs_release_path+0x4b/0x190 [btrfs]
      [75721.861429]  btrfs_csum_file_blocks+0x85c/0x1500 [btrfs]
      [75721.869313]  ? rcu_read_lock_sched_held+0x16/0x80
      [75721.876085]  ? lock_release+0x552/0xf80
      [75721.881957]  ? btrfs_del_csums+0x8c0/0x8c0 [btrfs]
      [75721.888886]  ? __kasan_check_write+0x14/0x20
      [75721.895152]  ? do_raw_read_unlock+0x44/0x80
      [75721.901323]  ? _raw_write_lock_irq+0x60/0x80
      [75721.907983]  ? btrfs_global_root+0xb9/0xe0 [btrfs]
      [75721.915166]  ? btrfs_csum_root+0x12b/0x180 [btrfs]
      [75721.921918]  ? btrfs_get_global_root+0x820/0x820 [btrfs]
      [75721.929166]  ? _raw_write_unlock+0x23/0x40
      [75721.935116]  ? unpin_extent_cache+0x1e3/0x390 [btrfs]
      [75721.942041]  btrfs_finish_ordered_io.isra.0+0xa0c/0x1dc0 [btrfs]
      [75721.949906]  ? try_to_wake_up+0x30/0x14a0
      [75721.955700]  ? btrfs_unlink_subvol+0xda0/0xda0 [btrfs]
      [75721.962661]  ? rcu_read_lock_sched_held+0x16/0x80
      [75721.969111]  ? lock_acquire+0x41b/0x4c0
      [75721.974982]  finish_ordered_fn+0x15/0x20 [btrfs]
      [75721.981639]  btrfs_work_helper+0x1af/0xa80 [btrfs]
      [75721.988184]  ? _raw_spin_unlock_irq+0x28/0x50
      [75721.994643]  process_one_work+0x815/0x1460
      [75722.000444]  ? pwq_dec_nr_in_flight+0x250/0x250
      [75722.006643]  ? do_raw_spin_trylock+0xbb/0x190
      [75722.013086]  worker_thread+0x59a/0xeb0
      [75722.018511]  kthread+0x2ac/0x360
      [75722.023428]  ? process_one_work+0x1460/0x1460
      [75722.029431]  ? kthread_complete_and_exit+0x30/0x30
      [75722.036044]  ret_from_fork+0x22/0x30
      [75722.041255]  </TASK>
      [75722.045047] irq event stamp: 0
      [75722.049703] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
      [75722.057610] hardirqs last disabled at (0): [<ffffffff8118a94a>] copy_process+0x1c1a/0x66b0
      [75722.067533] softirqs last  enabled at (0): [<ffffffff8118a989>] copy_process+0x1c59/0x66b0
      [75722.077423] softirqs last disabled at (0): [<0000000000000000>] 0x0
      [75722.085335] ---[ end trace 0000000000000000 ]---
      
      To fix the estimation, we need to introduce fs_info->max_extent_size to
      replace BTRFS_MAX_EXTENT_SIZE, which allow setting the different size for
      regular vs zoned filesystem.
      
      Set fs_info->max_extent_size to BTRFS_MAX_EXTENT_SIZE by default. On zoned
      filesystem, it is set to fs_info->max_zone_append_size.
      
      CC: stable@vger.kernel.org # 5.12+
      Fixes: d8e3fb10 ("btrfs: zoned: use ZONE_APPEND write for zoned mode")
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f7b12a62
    • Naohiro Aota's avatar
      btrfs: zoned: revive max_zone_append_bytes · c2ae7b77
      Naohiro Aota authored
      This patch is basically a revert of commit 5a80d1c6 ("btrfs: zoned:
      remove max_zone_append_size logic"), but without unnecessary ASSERT and
      check. The max_zone_append_size will be used as a hint to estimate the
      number of extents to cover delalloc/writeback region in the later commits.
      
      The size of a ZONE APPEND bio is also limited by queue_max_segments(), so
      this commit considers it to calculate max_zone_append_size. Technically, a
      bio can be larger than queue_max_segments() * PAGE_SIZE if the pages are
      contiguous. But, it is safe to consider "queue_max_segments() * PAGE_SIZE"
      as an upper limit of an extent size to calculate the number of extents
      needed to write data.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c2ae7b77
    • Naohiro Aota's avatar
      block: add bdev_max_segments() helper · 65ea1b66
      Naohiro Aota authored
      Add bdev_max_segments() like other queue parameters.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarJens Axboe <axboe@kernel.dk>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      65ea1b66
    • Filipe Manana's avatar
      btrfs: add optimized btrfs_ino() version for 64 bits systems · cf2404a9
      Filipe Manana authored
      Currently btrfs_ino() tries to use first the objectid of the inode's
      location key. This is to avoid truncation of the inode number on 32 bits
      platforms because the i_ino field of struct inode has the unsigned long
      type, while the objectid is a 64 bits unsigned type (u64) on every system.
      This logic was added in commit 33345d01 ("Btrfs: Always use 64bit
      inode number").
      
      However if we are running on a 64 bits system, we can always directly
      return the i_ino value from struct inode, which eliminates the need for
      he special if statement that tests for a location key type of
      BTRFS_ROOT_ITEM_KEY - in which case i_ino may not have the same value as
      the objectid in the inode's location objectid, it may have a value of
      BTRFS_EMPTY_SUBVOL_DIR_OBJECTID, for the case of snapshots of trees with
      subvolumes/snapshots inside them.
      
      So add a special version for 64 bits system that directly returns i_ino
      of struct inode. This eliminates one branch and reduces the overall code
      size, since btrfs_ino() is an inline function that is extensively used.
      
      Before:
      
      $ size fs/btrfs/btrfs.ko
         text	   data	    bss	    dec	    hex	filename
      1617487	 189240	  29032	1835759	 1c02ef	fs/btrfs/btrfs.ko
      
      After:
      
      $ size fs/btrfs/btrfs.ko
         text	   data	    bss	    dec	    hex	filename
      1612028	 189180	  29032	1830240	 1bed60	fs/btrfs/btrfs.ko
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cf2404a9
    • Filipe Manana's avatar
      btrfs: set the objectid of the btree inode's location key · adac5584
      Filipe Manana authored
      We currently don't use the location key of the btree inode, its content
      is set to zeroes, as it's a special inode that is not persisted (it has
      no inode item stored in any btree).
      
      At btrfs_ino(), an inline function used extensively in btrfs, we have
      this special check if the given inode's location objectid is 0, and if it
      is, we return the value stored in the VFS' inode i_ino field instead
      (which is BTRFS_BTREE_INODE_OBJECTID for the btree inode).
      
      To reduce the code at btrfs_ino(), we can simply set the objectid of the
      btree inode to the value BTRFS_BTREE_INODE_OBJECTID. This eliminates the
      need to check for the special case of the objectid being zero, with the
      side effect of reducing the overall code size and having less code to
      execute, as btrfs_ino() is an inline function.
      
      Before:
      
      $ size fs/btrfs/btrfs.ko
         text	   data	    bss	    dec	    hex	filename
      1620502	 189240	  29032	1838774	 1c0eb6	fs/btrfs/btrfs.ko
      
      After:
      
      $ size fs/btrfs/btrfs.ko
         text	   data	    bss	    dec	    hex	filename
      1617487	 189240	  29032	1835759	 1c02ef	fs/btrfs/btrfs.ko
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      adac5584
    • Fabio M. De Francesco's avatar
      btrfs: replace kmap_atomic() with kmap_local_page() · 4cb2e5e8
      Fabio M. De Francesco authored
      kmap_atomic() is being deprecated in favor of kmap_local_page() where it
      is feasible. With kmap_local_page() mappings are per thread, CPU local,
      and not globally visible.
      
      The last use of kmap_atomic is in inode.c where the context is atomic [1]
      and can be safely replaced by kmap_local_page.
      
      Tested with xfstests on a QEMU + KVM 32-bits VM with 4GB RAM and booting a
      kernel with HIGHMEM64GB enabled.
      
      [1] https://lore.kernel.org/linux-btrfs/20220601132545.GM20633@twin.jikos.cz/Suggested-by: default avatarIra Weiny <ira.weiny@intel.com>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Signed-off-by: default avatarFabio M. De Francesco <fmdefrancesco@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4cb2e5e8
    • Fabio M. De Francesco's avatar
      btrfs: zlib: replace kmap() with kmap_local_page() in zlib_decompress_bio() · 5a6e6e7c
      Fabio M. De Francesco authored
      The use of kmap() is being deprecated in favor of kmap_local_page(). With
      kmap_local_page(), the mapping is per thread, CPU local and not globally
      visible.
      
      Therefore, use kmap_local_page() / kunmap_local() in zlib_decompress_bio()
      because in this function the mappings are per thread and are not visible
      in other contexts.
      
      Tested with xfstests on QEMU + KVM 32-bits VM with 4GB of RAM and
      HIGHMEM64G enabled. This patch passes 26/26 tests of group "compress".
      Suggested-by: default avatarIra Weiny <ira.weiny@intel.com>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFabio M. De Francesco <fmdefrancesco@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5a6e6e7c
    • Fabio M. De Francesco's avatar
      btrfs: zlib: replace kmap() with kmap_local_page() in zlib_compress_pages() · 718e5855
      Fabio M. De Francesco authored
      The use of kmap() is being deprecated in favor of kmap_local_page(). With
      kmap_local_page(), the mapping is per thread, CPU local and not globally
      visible.
      
      Therefore, use kmap_local_page() / kunmap_local() in zlib_compress_pages()
      because in this function the mappings are per thread and are not visible
      in other contexts. Furthermore, drop the mappings of "out_page" which is
      allocated within zlib_compress_pages() with alloc_page(GFP_NOFS) and use
      page_address().
      
      Tested with xfstests on a QEMU + KVM 32-bits VM with 4GB of RAM booting
      a kernel with HIGHMEM64G enabled. This patch passes 26/26 tests of group
      "compress".
      
      CC: Qu Wenruo <wqu@suse.com>
      Suggested-by: default avatarIra Weiny <ira.weiny@intel.com>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Signed-off-by: default avatarFabio M. De Francesco <fmdefrancesco@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      718e5855
    • Fabio M. De Francesco's avatar
      btrfs: zstd: replace kmap() with kmap_local_page() · ebd23482
      Fabio M. De Francesco authored
      The use of kmap() is being deprecated in favor of kmap_local_page(). With
      kmap_local_page(), the mapping is per thread, CPU local and not globally
      visible.
      
      Therefore, use kmap_local_page() / kunmap_local() in zstd.c because in this
      file the mappings are per thread and are not visible in other contexts. In
      the meanwhile use plain page_address() on output pages allocated with
      the GFP_NOFS flag instead of calling kmap*() on them (since they are
      always allocated from ZONE_NORMAL).
      
      Tested with xfstests on QEMU + KVM 32 bits VM with 4GB of RAM, booting a
      kernel with HIGHMEM64G enabled.
      Suggested-by: default avatarIra Weiny <ira.weiny@intel.com>
      Signed-off-by: default avatarFabio M. De Francesco <fmdefrancesco@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ebd23482
    • Fabio M. De Francesco's avatar
      highmem: Make __kunmap_{local,atomic}() take const void pointer · 39ade048
      Fabio M. De Francesco authored
      __kunmap_ {local,atomic}() currently take pointers to void. However, this
      is semantically incorrect, since these functions do not change the memory
      their arguments point to.
      
      Therefore, make this semantics explicit by modifying the
      __kunmap_{local,atomic}() prototypes to take pointers to const void.
      
      As a side effect, compilers may produce more efficient code.
      Acked-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: Helge Deller <deller@gmx.de>  # parisc
      Suggested-by: default avatarDavid Sterba <dsterba@suse.cz>
      Suggested-by: default avatarIra Weiny <ira.weiny@intel.com>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Signed-off-by: default avatarFabio M. De Francesco <fmdefrancesco@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      39ade048
    • Filipe Manana's avatar
      btrfs: don't fallback to buffered IO for NOWAIT direct IO writes · ac5e6669
      Filipe Manana authored
      Currently, for a direct IO write, if we need to fallback to buffered IO,
      either to satisfy the whole write operation or just a part of it, we do
      it in the current context even if it's a NOWAIT context. This is not ideal
      because we currently don't have support for NOWAIT semantics in the
      buffered IO path (we can block for several reasons), so we should instead
      return -EAGAIN to the caller, so that it knows it should retry (the whole
      operation or what's left of it) in a context where blocking is acceptable.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ac5e6669
    • David Sterba's avatar
      btrfs: use enum for btrfs_block_rsv::type · 8bfc9b2c
      David Sterba authored
      The number of block group reserve types BTRFS_BLOCK_RSV_* is small and
      fits to u8 and there's enough left in case we want to add more.
      For type safety use the enum but make it 8 bits in the structure to save
      space.
      
      The structure size is now 48 on release build, making a slight
      improvement in structures where it's embedded, like btrfs_fs_info or
      btrfs_inode.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8bfc9b2c
    • David Sterba's avatar
      btrfs: switch btrfs_block_rsv::failfast to bool · 710d5921
      David Sterba authored
      Use simple bool type for the block reserve failfast status, there's
      short to save space as there used to be int but there's no reason for
      that.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      710d5921
    • David Sterba's avatar
      btrfs: switch btrfs_block_rsv::full to bool · c70c2c5b
      David Sterba authored
      Use simple bool type for the block reserve full status, there's short to
      save space as there used to be int but there's no reason for that.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c70c2c5b
    • Christoph Hellwig's avatar
      btrfs: do not return errors from btrfs_submit_dio_bio · 37899117
      Christoph Hellwig authored
      Always consume the bio and call the end_io handler on error instead of
      returning an error and letting the caller handle it.  This matches what
      the block layer submission and the other btrfs bio submission handlers do
      and avoids any confusion on who needs to handle errors.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Tested-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      37899117
    • Christoph Hellwig's avatar
      btrfs: handle allocation failure in btrfs_wq_submit_bio gracefully · ea1f0ced
      Christoph Hellwig authored
      btrfs_wq_submit_bio is used for writeback under memory pressure.
      Instead of failing the I/O when we can't allocate the async_submit_bio,
      just punt back to the synchronous submission path.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Tested-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ea1f0ced
    • Christoph Hellwig's avatar
      btrfs: simplify sync/async submission in btrfs_submit_data_write_bio · 82443fd5
      Christoph Hellwig authored
      btrfs_submit_data_write_bio special cases the reloc root because the
      checksums are preloaded, but only does so for the !sync case.  The sync
      case can't happen for data relocation, but just handling it more generally
      significantly simplifies the logic.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Tested-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      82443fd5
    • Christoph Hellwig's avatar
      btrfs: raid56: transfer the bio counter reference to the raid submission helpers · b9af128d
      Christoph Hellwig authored
      Transfer the bio counter reference acquired by btrfs_submit_bio to
      raid56_parity_write and raid56_parity_recovery together with the bio
      that the reference was acquired for instead of acquiring another
      reference in those helpers and dropping the original one in
      btrfs_submit_bio.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Tested-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b9af128d