1. 25 Jul, 2022 40 commits
    • Qu Wenruo's avatar
      btrfs: raid56: avoid double for loop inside finish_rmw() · 36920044
      Qu Wenruo authored
      We can easily calculate the stripe number and sector number inside the
      stripe.  Thus there is not much need for a double for loop.
      
      For the only case we want to skip the whole stripe, we can manually
      increase @total_sector_nr.
      This is not a recommended behavior, thus every time the iterator gets
      modified there will be a comment along with an ASSERT() for it.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      36920044
    • Josef Bacik's avatar
      btrfs: tree-log: make the return value for log syncing consistent · f31f09f6
      Josef Bacik authored
      Currently we will return 1 or -EAGAIN if we decide we need to commit
      the transaction rather than sync the log.  In practice this doesn't
      really matter, we interpret any !0 and !BTRFS_NO_LOG_SYNC as needing to
      commit the transaction.  However this makes it hard to figure out what
      the correct thing to do is.
      
      Fix this up by defining BTRFS_LOG_FORCE_COMMIT and using this in all the
      places where we want to force the transaction to be committed.
      
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f31f09f6
    • Johannes Thumshirn's avatar
      btrfs: add tracepoints for ordered extents · 5bea2508
      Johannes Thumshirn authored
      When debugging a reference counting issue with ordered extents, I've found
      we're lacking a lot of tracepoint coverage in the ordered extent code.
      
      Close these gaps by adding tracepoints after every refcount_inc() in the
      ordered extent code.
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5bea2508
    • David Sterba's avatar
      btrfs: sysfs: advertise zoned support among features · 15dcccdb
      David Sterba authored
      We've hidden the zoned support in sysfs under debug config for the first
      releases but now the stability is reasonable, though not all features
      have been implemented.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      15dcccdb
    • Christoph Hellwig's avatar
      btrfs: split discard handling out of btrfs_map_block · a4012f06
      Christoph Hellwig authored
      Mapping block for discard doesn't really share any code with the regular
      block mapping case.  Split it out into an entirely separate helper
      that just returns an array of btrfs_discard_stripe structures and the
      number of stripes.
      
      This removes the need for the length field in the btrfs_io_context
      structure, so remove tht.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a4012f06
    • Christoph Hellwig's avatar
      btrfs: stop looking at btrfs_bio->iter in index_one_bio · 5eecef71
      Christoph Hellwig authored
      All the bios that index_one_bio operates on are the bios submitted by the
      upper layer.  These are never resubmitted to an actual device by the
      raid56 code, and thus the iter never changes from the initial state.
      Thus we can always just use bi_iter directly as it will be the same as
      the saved copy.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5eecef71
    • Qu Wenruo's avatar
      btrfs: reject log replay if there is unsupported RO compat flag · dc4d3168
      Qu Wenruo authored
      [BUG]
      If we have a btrfs image with dirty log, along with an unsupported RO
      compatible flag:
      
      log_root		30474240
      ...
      compat_flags		0x0
      compat_ro_flags		0x40000003
      			( FREE_SPACE_TREE |
      			  FREE_SPACE_TREE_VALID |
      			  unknown flag: 0x40000000 )
      
      Then even if we can only mount it RO, we will still cause metadata
      update for log replay:
      
        BTRFS info (device dm-1): flagging fs with big metadata feature
        BTRFS info (device dm-1): using free space tree
        BTRFS info (device dm-1): has skinny extents
        BTRFS info (device dm-1): start tree-log replay
      
      This is definitely against RO compact flag requirement.
      
      [CAUSE]
      RO compact flag only forces us to do RO mount, but we will still do log
      replay for plain RO mount.
      
      Thus this will result us to do log replay and update metadata.
      
      This can be very problematic for new RO compat flag, for example older
      kernel can not understand v2 cache, and if we allow metadata update on
      RO mount and invalidate/corrupt v2 cache.
      
      [FIX]
      Just reject the mount unless rescue=nologreplay is provided:
      
        BTRFS error (device dm-1): cannot replay dirty log with unsupport optional features (0x40000000), try rescue=nologreplay instead
      
      We don't want to set rescue=nologreply directly, as this would make the
      end user to read the old data, and cause confusion.
      
      Since the such case is really rare, we're mostly fine to just reject the
      mount with an error message, which also includes the proper workaround.
      
      CC: stable@vger.kernel.org #4.9+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      dc4d3168
    • Qu Wenruo's avatar
      btrfs: make btrfs_super_block::log_root_transid deprecated · 97f09d55
      Qu Wenruo authored
      When using "btrfs inspect-internal dump-super" to inspect an fs with
      dirty log, it always shows the log_root_transid as 0:
      
        log_root                30474240
        log_root_transid        0 <<<
        log_root_level          0
      
      It turns out that, btrfs_super_block::log_root_transid is never really
      utilized (even no read for it).
      
      This can date back to the introduction of btrfs into upstream kernel.
      
      In fact, when reading log tree root, we always use
      btrfs_super_block::generation + 1 as the expected generation.
      So here we're completely safe to mark this member deprecated.
      
      In theory we can easily reuse this member for other purposes, but to be
      extra safe, here we follow the leafsize way, by adding "__unused_" for
      log_root_transid.
      And we can safely remove the accessors, since there is no such callers
      from the very beginning.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      97f09d55
    • Christoph Hellwig's avatar
      btrfs: pass the btrfs_bio_ctrl to submit_one_bio · 722c82ac
      Christoph Hellwig authored
      submit_one_bio always works on the bio and compression flags from a
      btrfs_bio_ctrl structure.  Pass the explicitly and clean up the
      calling conventions by handling a NULL bio in submit_one_bio, and
      using the btrfs_bio_ctrl to pass the mirror number as well.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      722c82ac
    • Christoph Hellwig's avatar
      btrfs: merge end_write_bio and flush_write_bio · 9845e5dd
      Christoph Hellwig authored
      Merge end_write_bio and flush_write_bio into a single submit_write_bio
      helper, that either submits the bio or ends it if a negative errno was
      passed in.  This consolidates a lot of duplicated checks in the callers.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9845e5dd
    • Christoph Hellwig's avatar
      btrfs: don't use bio->bi_private to pass the inode to submit_one_bio · 2d5ac130
      Christoph Hellwig authored
      submit_one_bio is only used for page cache I/O, so the inode can be
      trivially derived from the first page in the bio.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2d5ac130
    • David Sterba's avatar
      btrfs: remove redundant check in up check_setget_bounds · 234fdd28
      David Sterba authored
      There are two separate checks in the bounds checker, the first one being
      a special case of the second. As this function is performance critical
      due to checking access to any eb member, reducing the size can slightly
      improve performance.
      
      On a release build on x86_64 the helper is completely inlined so the
      function call overhead is also gone.
      
      There was a report of 5% performance drop on metadata heavy workload,
      that disappeared after disabling asserts. The most significant part of
      that is the bounds checker.
      
      https://lore.kernel.org/linux-btrfs/20200724164147.39925-1-josef@toxicpanda.com/
      
      After the analysis, the optimized code removes the worst overhead which
      is the function call and the performance was restored.
      
      https://lore.kernel.org/linux-btrfs/20200730110943.GE3703@twin.jikos.cz/
      
      1. baseline, asserts on, setget check on
      
      run time:		46s
      run time with perf:	48s
      
      2. asserts on, comment out setget check
      
      run time:		44s
      run time with perf:	47s
      
      So this is confirms the 5% difference
      
      3. asserts on, optimized seget check
      
      run time:		44s
      run time with perf:	47s
      
      The optimizations are reducing the number of ifs to 1 and inlining the
      hot path. Low-level stuff, gets the performance back. Patch below.
      
      4. asserts off, no setget check
      
      run time:		44s
      run time with perf:	45s
      
      This verifies that asserts other than the setget check have negligible
      impact on performance and it's not harmful to keep them on.
      
      Analysis where the performance is lost:
      
      * check_setget_bounds is short function, but it's still a function call,
        changing the flow of instructions and given how many times it's
        called the overhead adds up
      
      * there are two conditions, one to check if the range is
        completely outside (member_offset > eb->len) or partially inside
        (member_offset + size > eb->len)
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      234fdd28
    • Fabio M. De Francesco's avatar
      btrfs: replace kmap() with kmap_local_page() in lzo.c · 51c0674a
      Fabio M. De Francesco authored
      The use of kmap() is being deprecated in favor of kmap_local_page() where
      it is feasible. With kmap_local_page(), the mapping is per thread, CPU
      local and not globally visible.
      
      Therefore, use kmap_local_page() / kunmap_local() in lzo.c wherever the
      mappings are per thread and not globally visible.
      
      Tested on QEMU + KVM 32 bits VM with 4GB of RAM and HIGHMEM64G enabled.
      Suggested-by: default avatarIra Weiny <ira.weiny@intel.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarFabio M. De Francesco <fmdefrancesco@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      51c0674a
    • Fabio M. De Francesco's avatar
      btrfs: replace kmap() with kmap_local_page() in inode.c · 70826b6b
      Fabio M. De Francesco authored
      The use of kmap() is being deprecated in favor of kmap_local_page() where
      it is feasible. With kmap_local_page(), the mapping is per thread, CPU
      local and not globally visible.
      
      Therefore, use kmap_local_page() / kunmap_local() in inode.c wherever the
      mappings are per thread and not globally visible.
      
      Tested on QEMU + KVM 32 bits VM with 4GB of RAM and HIGHMEM64G enabled.
      Suggested-by: default avatarIra Weiny <ira.weiny@intel.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarFabio M. De Francesco <fmdefrancesco@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      70826b6b
    • Christoph Hellwig's avatar
      btrfs: do not allocate a btrfs_bio for low-level bios · 9ff7ddd3
      Christoph Hellwig authored
      The bios submitted from btrfs_map_bio don't really interact with the
      rest of btrfs and the only btrfs_bio member actually used in the
      low-level bios is the pointer to the btrfs_io_context used for endio
      handler.
      
      Use a union in struct btrfs_io_stripe that allows the endio handler to
      find the btrfs_io_context and remove the spurious ->device assignment
      so that a plain fs_bio_set bio can be used for the low-level bios
      allocated inside btrfs_map_bio.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9ff7ddd3
    • Christoph Hellwig's avatar
      btrfs: factor stripe submission logic out of btrfs_map_bio · a316a259
      Christoph Hellwig authored
      Move all per-stripe handling into submit_stripe_bio and use a label to
      cleanup instead of duplicating the logic.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a316a259
    • Christoph Hellwig's avatar
      btrfs: remove btrfs_end_io_wq · d7b9416f
      Christoph Hellwig authored
      All reads bio that go through btrfs_map_bio need to be completed in
      user context.  And read I/Os are the most common and timing critical
      in almost any file system workloads.
      
      Embed a work_struct into struct btrfs_bio and use it to complete all
      read bios submitted through btrfs_map, using the REQ_META flag to decide
      which workqueue they are placed on.
      
      This removes the need for a separate 128 byte allocation (typically
      rounded up to 192 bytes by slab) for all reads with a size increase
      of 24 bytes for struct btrfs_bio.  Future patches will reorganize
      struct btrfs_bio to make use of this extra space for writes as well.
      
      (All sizes are based a on typical 64-bit non-debug build)
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d7b9416f
    • Christoph Hellwig's avatar
      btrfs: centralize setting REQ_META · 08a6f464
      Christoph Hellwig authored
      Set REQ_META in btrfs_submit_metadata_bio instead of the various callers.
      We'll start relying on this flag inside of btrfs in a bit, and this
      ensures it is always set correctly.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      08a6f464
    • Christoph Hellwig's avatar
      btrfs: don't use btrfs_bio_wq_end_io for compressed writes · fed8a72d
      Christoph Hellwig authored
      Compressed write bio completion is the only user of btrfs_bio_wq_end_io
      for writes, and the use of btrfs_bio_wq_end_io is a little suboptimal
      here as we only real need user context for the final completion of a
      compressed_bio structure, and not every single bio completion.
      
      Add a work_struct to struct compressed_bio instead and use that to call
      finish_compressed_bio_write.  This allows to remove all handling of
      write bios in the btrfs_bio_wq_end_io infrastructure.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fed8a72d
    • Christoph Hellwig's avatar
      btrfs: don't double-defer bio completions for compressed reads · 02bb5b72
      Christoph Hellwig authored
      The bio completion handler of the bio used for the compressed data is
      already run in a workqueue using btrfs_bio_wq_end_io, so don't schedule
      the completion of the original bio to the same workqueue again but just
      execute it directly.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      02bb5b72
    • Christoph Hellwig's avatar
      btrfs: defer I/O completion based on the btrfs_raid_bio · d34e123d
      Christoph Hellwig authored
      Instead of attaching an extra allocation an indirect call to each
      low-level bio issued by the RAID code, add a work_struct to struct
      btrfs_raid_bio and only defer the per-rbio completion action.  The
      per-bio action for all the I/Os are trivial and can be safely done
      from interrupt context.
      
      As a nice side effect this also allows sharing the boilerplate code
      for the per-bio completions
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d34e123d
    • Christoph Hellwig's avatar
      btrfs: split btrfs_submit_data_bio to read and write parts · c93104e7
      Christoph Hellwig authored
      Split btrfs_submit_data_bio into one helper for reads and one for writes.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c93104e7
    • Christoph Hellwig's avatar
      btrfs: simplify code flow in btrfs_submit_dio_bio · e6484bd4
      Christoph Hellwig authored
      There is no exit block and cleanup and the function is reasonably short
      so we can use inline return and not the goto. This makes the function
      more straight forward.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e6484bd4
    • Christoph Hellwig's avatar
      btrfs: move more work into btrfs_end_bioc · b4c46bde
      Christoph Hellwig authored
      Assign ->mirror_num and ->bi_status in btrfs_end_bioc instead of
      duplicating the logic in the callers.  Also remove the bio argument as
      it always must be bioc->orig_bio and the now pointless bioc_error that
      did nothing but assign bi_sector to the same value just sampled in the
      caller.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b4c46bde
    • Omar Sandoval's avatar
      btrfs: send: enable support for stream v2 and compressed writes · d6815592
      Omar Sandoval authored
      Now that the new support is implemented, allow the ioctl to accept v2
      and the compressed flag, and update the version in sysfs.
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d6815592
    • Omar Sandoval's avatar
      btrfs: send: send compressed extents with encoded writes · 3ea4dc5b
      Omar Sandoval authored
      Now that all of the pieces are in place, we can use the ENCODED_WRITE
      command to send compressed extents when appropriate.
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3ea4dc5b
    • Omar Sandoval's avatar
      btrfs: send: get send buffer pages for protocol v2 · a4b333f2
      Omar Sandoval authored
      For encoded writes in send v2, we will get the encoded data with
      btrfs_encoded_read_regular_fill_pages(), which expects a list of raw
      pages. To avoid extra buffers and copies, we should read directly into
      the send buffer. Therefore, we need the raw pages for the send buffer.
      
      We currently allocate the send buffer with kvmalloc(), which may return
      a kmalloc'd buffer or a vmalloc'd buffer. For vmalloc, we can get the
      pages with vmalloc_to_page(). For kmalloc, we could use virt_to_page().
      However, the buffer size we use (144K) is not a power of two, which in
      theory is not guaranteed to return a page-aligned buffer, and in
      practice would waste a lot of memory due to rounding up to the next
      power of two. 144K is large enough that it usually gets allocated with
      vmalloc(), anyways. So, for send v2, replace kvmalloc() with vmalloc()
      and save the pages in an array.
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a4b333f2
    • Omar Sandoval's avatar
      btrfs: send: write larger chunks when using stream v2 · 356bbbb6
      Omar Sandoval authored
      The length field of the send stream TLV header is 16 bits. This means
      that the maximum amount of data that can be sent for one write is 64K
      minus one. However, encoded writes must be able to send the maximum
      compressed extent (128K) in one command, or more. To support this, send
      stream version 2 encodes the DATA attribute differently: it has no
      length field, and the length is implicitly up to the end of containing
      command (which has a 32bit length field). Although this is necessary
      for encoded writes, normal writes can benefit from it, too.
      
      Also add a check to enforce that the DATA attribute is last. It is only
      strictly necessary for v2, but we might as well make v1 consistent with
      it.
      
      For v2, let's bump up the send buffer to the maximum compressed extent
      size plus 16K for the other metadata (144K total). Since this will most
      likely be vmalloc'd (and always will be after the next commit), we round
      it up to the next page since we might as well use the rest of the page
      on systems with >16K pages.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      356bbbb6
    • Omar Sandoval's avatar
      btrfs: send: add stream v2 definitions · b7c14f23
      Omar Sandoval authored
      This adds the definitions of the new commands for send stream version 2
      and their respective attributes: fallocate, FS_IOC_SETFLAGS (a.k.a.
      chattr), and encoded writes. It also documents two changes to the send
      stream format in v2: the receiver shouldn't assume a maximum command
      size, and the DATA attribute is encoded differently to allow for writes
      larger than 64k. These will be implemented in subsequent changes, and
      then the ioctl will accept the new version and flag.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b7c14f23
    • Omar Sandoval's avatar
      btrfs: send: explicitly number commands and attributes · 54cab6af
      Omar Sandoval authored
      Commit e77fbf99 ("btrfs: send: prepare for v2 protocol") added
      _BTRFS_SEND_C_MAX_V* macros equal to the maximum command number for the
      version plus 1, but as written this creates gaps in the number space.
      
      The maximum command number is currently 22, and __BTRFS_SEND_C_MAX_V1 is
      accordingly 23. But then __BTRFS_SEND_C_MAX_V2 is 24, suggesting that v2
      has a command numbered 23, and __BTRFS_SEND_C_MAX is 25, suggesting that
      23 and 24 are valid commands.
      
      Instead, let's explicitly number all of the commands, attributes, and
      sentinel MAX constants.
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      54cab6af
    • Omar Sandoval's avatar
      btrfs: send: remove unused send_ctx::{total,cmd}_send_size · ca182acc
      Omar Sandoval authored
      We collect these statistics but have never exposed them in any way. I
      also didn't find any patches that ever attempted to make use of them.
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ca182acc
    • Stefan Roesch's avatar
      btrfs: sysfs: add force_chunk_alloc trigger to force allocation · 22c55e3b
      Stefan Roesch authored
      Adds write-only trigger to force new chunk allocation for a given block
      group type. It is at
      
        /sys/fs/btrfs/<uuid>/allocation/<type>/force_chunk_alloc
      
      Note: this is now only for debugging and testing and is enabled with the
            CONFIG_BTRFS_DEBUG configuration option. The transaction is
            started from sysfs context and can be problematic in some cases.
      Signed-off-by: default avatarStefan Roesch <shr@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ Changes from the original submission:
        - update changelog
        - drop unnecessary error messages
        - switch value to bool and use kstrtobool
        - move BTRFS_ATTR_W definition
        - add comment for using transaction
      ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      22c55e3b
    • Stefan Roesch's avatar
      btrfs: sysfs: export chunk size in space infos · 19fc516a
      Stefan Roesch authored
      Add new sysfs knob
      
        /sys/fs/btrfs/<uuid>/allocation/<type>/chunk_size.
      
      This allows to query the chunk size and also set the chunk size.
      
      Constraints:
      
      - can be changed by root only
      - system chunk size can't be set
      - maximum chunk size is 10% of the filesystem size
      - final value is rounded down to a multiple of 256M
      - cannot be set on zoned filesystem
      
      Note, that rounding and the 10% clamp will result to a different value
      on filesystems smaller than 10G, typically 768M.
      Signed-off-by: default avatarStefan Roesch <shr@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ Changes to original submission:
        - document setting constraints
        - drop read-only requirement
        - drop unnecessary error messages
        - fix return values of _store callback
        - use memparse for the value
        - fix rounding down to 256M
      ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      19fc516a
    • Stefan Roesch's avatar
      btrfs: store chunk size in space-info struct · f6fca391
      Stefan Roesch authored
      The chunk size is stored in the btrfs_space_info structure.  It is
      initialized at the start and is then used.
      
      A new API is added to update the current chunk size.  This API is used
      to be able to expose the chunk_size as a sysfs setting.
      Signed-off-by: default avatarStefan Roesch <shr@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ rename and merge helpers, switch atomic type to u64, style fixes ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f6fca391
    • Josef Bacik's avatar
      btrfs: do not batch insert non-consecutive dir indexes during log replay · 71b68e9e
      Josef Bacik authored
      While running generic/475 in a loop I got the following error
      
      BTRFS critical (device dm-11): corrupt leaf: root=5 block=31096832 slot=69, bad key order, prev (263 96 531) current (263 96 524)
      <snip>
       item 65 key (263 96 517) itemoff 14132 itemsize 33
       item 66 key (263 96 523) itemoff 14099 itemsize 33
       item 67 key (263 96 525) itemoff 14066 itemsize 33
       item 68 key (263 96 531) itemoff 14033 itemsize 33
       item 69 key (263 96 524) itemoff 14000 itemsize 33
      
      As you can see here we have 3 dir index keys with the dir index value of
      523, 524, and 525 inserted between 517 and 524.  This occurs because our
      dir index insertion code will bulk insert all dir index items on the
      node regardless of their actual key value.
      
      This makes sense on a normally running system, because if there's a gap
      in between the items there was a deletion before the item was inserted,
      so there's not going to be an overlap of the dir index items that need
      to be inserted and what exists on disk.
      
      However during log replay this isn't necessarily true, we could have any
      number of dir indexes in the tree already.
      
      Fix this by seeing if we're replaying the log, and if we are simply skip
      batching if there's a gap in the key space.
      
      This file system was left broken from the fstest, I tested this patch
      against the broken fs to make sure it replayed the log properly, and
      then btrfs checked the file system after the log replay to verify
      everything was ok.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      71b68e9e
    • Filipe Manana's avatar
      btrfs: reduce amount of reserved metadata for delayed item insertion · 763748b2
      Filipe Manana authored
      Whenever we want to create a new dir index item (when creating an inode,
      create a hard link, rename a file) we reserve 1 unit of metadata space
      for it in a transaction (that's 256K for a node/leaf size of 16K), and
      then create a delayed insertion item for it to be added later to the
      subvolume's tree. That unit of metadata is kept until the delayed item
      is inserted into the subvolume tree, which may take a while to happen
      (in the worst case, it's done only when the transaction commits). If we
      have multiple dir index items to insert for the same directory, say N
      index items, and they all fit in a single leaf of metadata, then we are
      holding N units of reserved metadata space when all we need is 1 unit.
      
      This change addresses that, whenever a new delayed dir index item is
      added, we release the unit of metadata the caller has reserved when it
      started the transaction if adding that new dir index item does not
      result in touching one more metadata leaf, otherwise the reservation
      is kept by transferring it from the transaction block reserve to the
      delayed items block reserve, just like before. Given that with a leaf
      size of 16K we can have a few hundred dir index items in a single leaf
      (the exact value depends on file name lengths), this reduces pressure on
      metadata reservation by releasing unnecessary space much sooner.
      
      The following fs_mark test showed some improvement when creating many
      files in parallel on machine running a non debug kernel (debian's default
      kernel config) with 12 cores:
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/nvme0n1
        MNT=/mnt/nvme0n1
        MOUNT_OPTIONS="-o ssd"
        FILES=100000
        THREADS=$(nproc --all)
      
        echo "performance" | \
            tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
        mkfs.btrfs -f $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
      
        OPTS="-S 0 -L 10 -n $FILES -s 0 -t $THREADS -k"
        for ((i = 1; i <= $THREADS; i++)); do
            OPTS="$OPTS -d $MNT/d$i"
        done
      
        fs_mark $OPTS
      
        umount $MNT
      
      Before:
      
      FSUse%        Count         Size    Files/sec     App Overhead
           2      1200000            0     225991.3          5465891
           4      2400000            0     345728.1          5512106
           4      3600000            0     346959.5          5557653
           8      4800000            0     329643.0          5587548
           8      6000000            0     312657.4          5606717
           8      7200000            0     281707.5          5727985
          12      8400000            0      88309.8          5020422
          12      9600000            0      85835.9          5207496
          16     10800000            0      81039.2          5404964
          16     12000000            0      58548.6          5842468
      
      After:
      
      FSUse%        Count         Size    Files/sec     App Overhead
           2      1200000            0     230604.5          5778375
           4      2400000            0     348908.3          5508072
           4      3600000            0     357028.7          5484337
           6      4800000            0     342898.3          5565703
           6      6000000            0     314670.8          5751555
           8      7200000            0     282548.2          5778177
          12      8400000            0      90844.9          5306819
          12      9600000            0      86963.1          5304689
          16     10800000            0      89113.2          5455248
          16     12000000            0      86693.5          5518933
      
      The "after" results are after applying this patch and all the other
      patches in the same patchset, which is comprised of the following
      changes:
      
        btrfs: balance btree dirty pages and delayed items after a rename
        btrfs: free the path earlier when creating a new inode
        btrfs: balance btree dirty pages and delayed items after clone and dedupe
        btrfs: add assertions when deleting batches of delayed items
        btrfs: deal with deletion errors when deleting delayed items
        btrfs: refactor the delayed item deletion entry point
        btrfs: improve batch deletion of delayed dir index items
        btrfs: assert that delayed item is a dir index item when adding it
        btrfs: improve batch insertion of delayed dir index items
        btrfs: do not BUG_ON() on failure to reserve metadata for delayed item
        btrfs: set delayed item type when initializing it
        btrfs: reduce amount of reserved metadata for delayed item insertion
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      763748b2
    • Filipe Manana's avatar
      btrfs: set delayed item type when initializing it · c9d02ab4
      Filipe Manana authored
      Currently we set the type of a delayed item only after successfully
      inserting it into its respective rbtree. This is fine, as the type
      is not used anywhere before that point, but for the next patch in the
      series, there will be the need to check the type of a delayed item
      before inserting it into a rbtree.
      
      So set the type of a delayed item immediately after allocating it.
      This also makes the trivial wrappers for adding insertion and deletion
      useless, so it removes them as well.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c9d02ab4
    • Filipe Manana's avatar
      btrfs: do not BUG_ON() on failure to reserve metadata for delayed item · 3bae13e9
      Filipe Manana authored
      At btrfs_insert_delayed_dir_index(), we don't expect the metadata
      reservation for the delayed dir index item insertion to fail, because the
      caller is supposed to have reserved 1 unit of metadata space for that.
      All callers are able to deal with an error in case that happens, so there
      is no need for something so drastic as a BUG_ON() in case of failure.
      Instead just emit a warning, so that's easily noticed during development
      (fstests in particular), and return the error to the caller.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3bae13e9
    • Filipe Manana's avatar
      btrfs: improve batch insertion of delayed dir index items · 06ac264f
      Filipe Manana authored
      Currently we group delayed dir index items for insertion as a single batch
      (a single btree operation) as long as their keys are sequential in the key
      space.
      
      For example we have delayed index items for the following index keys:
      
         10, 11, 12, 15, 16, 20, 21
      
      We end up building three batches:
      
      1) First one for index keys 10, 11 and 12;
      2) Second one for index keys 15 and 16;
      3) Third one for index keys 20 and 21.
      
      However, since the dir index numbers come from a monotonically increasing
      counter and are never reused, we could group all these items into a single
      batch. The existence of holes in the sequence happens only when we had
      delayed dir index items for insertion that got deleted before they were
      flushed to the subvolume's tree.
      
      The delayed items are stored in a rbtree based on their key order, so
      we can just group items into a batch as long as they all fit in a leaf,
      and ignore if there's a gap (key offset, index number) between two
      consecutive items. This is more efficient and reduces the amount of
      time spent when running delayed items if there are gaps between dir
      index items.
      
      For example running the following test script:
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/sdj
        MNT=/mnt/sdj
      
        mkfs.btrfs -f $DEV
        mount $DEV $MNT
      
        NUM_FILES=100
      
        mkdir $MNT/testdir
        for ((i = 1; i <= $NUM_FILES; i++)); do
             echo -n > $MNT/testdir/file_$i
        done
      
        # Now delete every other file, to create gaps in the dir index keys.
        for ((i = 1; i <= $NUM_FILES; i += 2)); do
            rm -f $MNT/testdir/file_$i
        done
      
        start=$(date +%s%N)
        sync
        end=$(date +%s%N)
        dur=$(( (end - start) / 1000000 ))
      
        echo -e "\nsync took $dur milliseconds"
      
        umount $MNT
      
      While having the following bpftrace script running in another shell:
      
        $ cat bpf-delayed-items-inserts.sh
        #!/usr/bin/bpftrace
      
        /* Must add 'noinline' to btrfs_insert_delayed_items(). */
        k:btrfs_insert_delayed_items
        {
            @start_insert_delayed_items[tid] = nsecs;
        }
      
        k:btrfs_insert_empty_items
        /@start_insert_delayed_items[tid]/
        {
           @insert_batches = count();
        }
      
        kr:btrfs_insert_delayed_items
        /@start_insert_delayed_items[tid]/
        {
            $dur = (nsecs - @start_insert_delayed_items[tid]) / 1000;
            @btrfs_insert_delayed_items_total_time = sum($dur);
            delete(@start_insert_delayed_items[tid]);
        }
      
      Before this change:
      
      @btrfs_insert_delayed_items_total_time: 576
      @insert_batches: 51
      
      After this change:
      
      @btrfs_insert_delayed_items_total_time: 174
      @insert_batches: 2
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      06ac264f
    • Filipe Manana's avatar
      btrfs: assert that delayed item is a dir index item when adding it · a176affe
      Filipe Manana authored
      All delayed items are for dir index items, we don't support any other item
      types at the moment. So simplify __btrfs_add_delayed_item() and add an
      assertion for checking the item's key type. This also allows the next
      change to be simpler and avoid to check key types. In case we add support
      for different item types in the future, then we'll hit the assertion
      during development and be able to adjust any code that is assuming delayed
      items are always associated to dir index items.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a176affe