1. 25 Jul, 2022 40 commits
    • Qu Wenruo's avatar
      btrfs: reject log replay if there is unsupported RO compat flag · dc4d3168
      Qu Wenruo authored
      [BUG]
      If we have a btrfs image with dirty log, along with an unsupported RO
      compatible flag:
      
      log_root		30474240
      ...
      compat_flags		0x0
      compat_ro_flags		0x40000003
      			( FREE_SPACE_TREE |
      			  FREE_SPACE_TREE_VALID |
      			  unknown flag: 0x40000000 )
      
      Then even if we can only mount it RO, we will still cause metadata
      update for log replay:
      
        BTRFS info (device dm-1): flagging fs with big metadata feature
        BTRFS info (device dm-1): using free space tree
        BTRFS info (device dm-1): has skinny extents
        BTRFS info (device dm-1): start tree-log replay
      
      This is definitely against RO compact flag requirement.
      
      [CAUSE]
      RO compact flag only forces us to do RO mount, but we will still do log
      replay for plain RO mount.
      
      Thus this will result us to do log replay and update metadata.
      
      This can be very problematic for new RO compat flag, for example older
      kernel can not understand v2 cache, and if we allow metadata update on
      RO mount and invalidate/corrupt v2 cache.
      
      [FIX]
      Just reject the mount unless rescue=nologreplay is provided:
      
        BTRFS error (device dm-1): cannot replay dirty log with unsupport optional features (0x40000000), try rescue=nologreplay instead
      
      We don't want to set rescue=nologreply directly, as this would make the
      end user to read the old data, and cause confusion.
      
      Since the such case is really rare, we're mostly fine to just reject the
      mount with an error message, which also includes the proper workaround.
      
      CC: stable@vger.kernel.org #4.9+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      dc4d3168
    • Qu Wenruo's avatar
      btrfs: make btrfs_super_block::log_root_transid deprecated · 97f09d55
      Qu Wenruo authored
      When using "btrfs inspect-internal dump-super" to inspect an fs with
      dirty log, it always shows the log_root_transid as 0:
      
        log_root                30474240
        log_root_transid        0 <<<
        log_root_level          0
      
      It turns out that, btrfs_super_block::log_root_transid is never really
      utilized (even no read for it).
      
      This can date back to the introduction of btrfs into upstream kernel.
      
      In fact, when reading log tree root, we always use
      btrfs_super_block::generation + 1 as the expected generation.
      So here we're completely safe to mark this member deprecated.
      
      In theory we can easily reuse this member for other purposes, but to be
      extra safe, here we follow the leafsize way, by adding "__unused_" for
      log_root_transid.
      And we can safely remove the accessors, since there is no such callers
      from the very beginning.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      97f09d55
    • Christoph Hellwig's avatar
      btrfs: pass the btrfs_bio_ctrl to submit_one_bio · 722c82ac
      Christoph Hellwig authored
      submit_one_bio always works on the bio and compression flags from a
      btrfs_bio_ctrl structure.  Pass the explicitly and clean up the
      calling conventions by handling a NULL bio in submit_one_bio, and
      using the btrfs_bio_ctrl to pass the mirror number as well.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      722c82ac
    • Christoph Hellwig's avatar
      btrfs: merge end_write_bio and flush_write_bio · 9845e5dd
      Christoph Hellwig authored
      Merge end_write_bio and flush_write_bio into a single submit_write_bio
      helper, that either submits the bio or ends it if a negative errno was
      passed in.  This consolidates a lot of duplicated checks in the callers.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9845e5dd
    • Christoph Hellwig's avatar
      btrfs: don't use bio->bi_private to pass the inode to submit_one_bio · 2d5ac130
      Christoph Hellwig authored
      submit_one_bio is only used for page cache I/O, so the inode can be
      trivially derived from the first page in the bio.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2d5ac130
    • David Sterba's avatar
      btrfs: remove redundant check in up check_setget_bounds · 234fdd28
      David Sterba authored
      There are two separate checks in the bounds checker, the first one being
      a special case of the second. As this function is performance critical
      due to checking access to any eb member, reducing the size can slightly
      improve performance.
      
      On a release build on x86_64 the helper is completely inlined so the
      function call overhead is also gone.
      
      There was a report of 5% performance drop on metadata heavy workload,
      that disappeared after disabling asserts. The most significant part of
      that is the bounds checker.
      
      https://lore.kernel.org/linux-btrfs/20200724164147.39925-1-josef@toxicpanda.com/
      
      After the analysis, the optimized code removes the worst overhead which
      is the function call and the performance was restored.
      
      https://lore.kernel.org/linux-btrfs/20200730110943.GE3703@twin.jikos.cz/
      
      1. baseline, asserts on, setget check on
      
      run time:		46s
      run time with perf:	48s
      
      2. asserts on, comment out setget check
      
      run time:		44s
      run time with perf:	47s
      
      So this is confirms the 5% difference
      
      3. asserts on, optimized seget check
      
      run time:		44s
      run time with perf:	47s
      
      The optimizations are reducing the number of ifs to 1 and inlining the
      hot path. Low-level stuff, gets the performance back. Patch below.
      
      4. asserts off, no setget check
      
      run time:		44s
      run time with perf:	45s
      
      This verifies that asserts other than the setget check have negligible
      impact on performance and it's not harmful to keep them on.
      
      Analysis where the performance is lost:
      
      * check_setget_bounds is short function, but it's still a function call,
        changing the flow of instructions and given how many times it's
        called the overhead adds up
      
      * there are two conditions, one to check if the range is
        completely outside (member_offset > eb->len) or partially inside
        (member_offset + size > eb->len)
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      234fdd28
    • Fabio M. De Francesco's avatar
      btrfs: replace kmap() with kmap_local_page() in lzo.c · 51c0674a
      Fabio M. De Francesco authored
      The use of kmap() is being deprecated in favor of kmap_local_page() where
      it is feasible. With kmap_local_page(), the mapping is per thread, CPU
      local and not globally visible.
      
      Therefore, use kmap_local_page() / kunmap_local() in lzo.c wherever the
      mappings are per thread and not globally visible.
      
      Tested on QEMU + KVM 32 bits VM with 4GB of RAM and HIGHMEM64G enabled.
      Suggested-by: default avatarIra Weiny <ira.weiny@intel.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarFabio M. De Francesco <fmdefrancesco@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      51c0674a
    • Fabio M. De Francesco's avatar
      btrfs: replace kmap() with kmap_local_page() in inode.c · 70826b6b
      Fabio M. De Francesco authored
      The use of kmap() is being deprecated in favor of kmap_local_page() where
      it is feasible. With kmap_local_page(), the mapping is per thread, CPU
      local and not globally visible.
      
      Therefore, use kmap_local_page() / kunmap_local() in inode.c wherever the
      mappings are per thread and not globally visible.
      
      Tested on QEMU + KVM 32 bits VM with 4GB of RAM and HIGHMEM64G enabled.
      Suggested-by: default avatarIra Weiny <ira.weiny@intel.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarFabio M. De Francesco <fmdefrancesco@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      70826b6b
    • Christoph Hellwig's avatar
      btrfs: do not allocate a btrfs_bio for low-level bios · 9ff7ddd3
      Christoph Hellwig authored
      The bios submitted from btrfs_map_bio don't really interact with the
      rest of btrfs and the only btrfs_bio member actually used in the
      low-level bios is the pointer to the btrfs_io_context used for endio
      handler.
      
      Use a union in struct btrfs_io_stripe that allows the endio handler to
      find the btrfs_io_context and remove the spurious ->device assignment
      so that a plain fs_bio_set bio can be used for the low-level bios
      allocated inside btrfs_map_bio.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9ff7ddd3
    • Christoph Hellwig's avatar
      btrfs: factor stripe submission logic out of btrfs_map_bio · a316a259
      Christoph Hellwig authored
      Move all per-stripe handling into submit_stripe_bio and use a label to
      cleanup instead of duplicating the logic.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a316a259
    • Christoph Hellwig's avatar
      btrfs: remove btrfs_end_io_wq · d7b9416f
      Christoph Hellwig authored
      All reads bio that go through btrfs_map_bio need to be completed in
      user context.  And read I/Os are the most common and timing critical
      in almost any file system workloads.
      
      Embed a work_struct into struct btrfs_bio and use it to complete all
      read bios submitted through btrfs_map, using the REQ_META flag to decide
      which workqueue they are placed on.
      
      This removes the need for a separate 128 byte allocation (typically
      rounded up to 192 bytes by slab) for all reads with a size increase
      of 24 bytes for struct btrfs_bio.  Future patches will reorganize
      struct btrfs_bio to make use of this extra space for writes as well.
      
      (All sizes are based a on typical 64-bit non-debug build)
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d7b9416f
    • Christoph Hellwig's avatar
      btrfs: centralize setting REQ_META · 08a6f464
      Christoph Hellwig authored
      Set REQ_META in btrfs_submit_metadata_bio instead of the various callers.
      We'll start relying on this flag inside of btrfs in a bit, and this
      ensures it is always set correctly.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      08a6f464
    • Christoph Hellwig's avatar
      btrfs: don't use btrfs_bio_wq_end_io for compressed writes · fed8a72d
      Christoph Hellwig authored
      Compressed write bio completion is the only user of btrfs_bio_wq_end_io
      for writes, and the use of btrfs_bio_wq_end_io is a little suboptimal
      here as we only real need user context for the final completion of a
      compressed_bio structure, and not every single bio completion.
      
      Add a work_struct to struct compressed_bio instead and use that to call
      finish_compressed_bio_write.  This allows to remove all handling of
      write bios in the btrfs_bio_wq_end_io infrastructure.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fed8a72d
    • Christoph Hellwig's avatar
      btrfs: don't double-defer bio completions for compressed reads · 02bb5b72
      Christoph Hellwig authored
      The bio completion handler of the bio used for the compressed data is
      already run in a workqueue using btrfs_bio_wq_end_io, so don't schedule
      the completion of the original bio to the same workqueue again but just
      execute it directly.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      02bb5b72
    • Christoph Hellwig's avatar
      btrfs: defer I/O completion based on the btrfs_raid_bio · d34e123d
      Christoph Hellwig authored
      Instead of attaching an extra allocation an indirect call to each
      low-level bio issued by the RAID code, add a work_struct to struct
      btrfs_raid_bio and only defer the per-rbio completion action.  The
      per-bio action for all the I/Os are trivial and can be safely done
      from interrupt context.
      
      As a nice side effect this also allows sharing the boilerplate code
      for the per-bio completions
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d34e123d
    • Christoph Hellwig's avatar
      btrfs: split btrfs_submit_data_bio to read and write parts · c93104e7
      Christoph Hellwig authored
      Split btrfs_submit_data_bio into one helper for reads and one for writes.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c93104e7
    • Christoph Hellwig's avatar
      btrfs: simplify code flow in btrfs_submit_dio_bio · e6484bd4
      Christoph Hellwig authored
      There is no exit block and cleanup and the function is reasonably short
      so we can use inline return and not the goto. This makes the function
      more straight forward.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e6484bd4
    • Christoph Hellwig's avatar
      btrfs: move more work into btrfs_end_bioc · b4c46bde
      Christoph Hellwig authored
      Assign ->mirror_num and ->bi_status in btrfs_end_bioc instead of
      duplicating the logic in the callers.  Also remove the bio argument as
      it always must be bioc->orig_bio and the now pointless bioc_error that
      did nothing but assign bi_sector to the same value just sampled in the
      caller.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b4c46bde
    • Omar Sandoval's avatar
      btrfs: send: enable support for stream v2 and compressed writes · d6815592
      Omar Sandoval authored
      Now that the new support is implemented, allow the ioctl to accept v2
      and the compressed flag, and update the version in sysfs.
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d6815592
    • Omar Sandoval's avatar
      btrfs: send: send compressed extents with encoded writes · 3ea4dc5b
      Omar Sandoval authored
      Now that all of the pieces are in place, we can use the ENCODED_WRITE
      command to send compressed extents when appropriate.
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3ea4dc5b
    • Omar Sandoval's avatar
      btrfs: send: get send buffer pages for protocol v2 · a4b333f2
      Omar Sandoval authored
      For encoded writes in send v2, we will get the encoded data with
      btrfs_encoded_read_regular_fill_pages(), which expects a list of raw
      pages. To avoid extra buffers and copies, we should read directly into
      the send buffer. Therefore, we need the raw pages for the send buffer.
      
      We currently allocate the send buffer with kvmalloc(), which may return
      a kmalloc'd buffer or a vmalloc'd buffer. For vmalloc, we can get the
      pages with vmalloc_to_page(). For kmalloc, we could use virt_to_page().
      However, the buffer size we use (144K) is not a power of two, which in
      theory is not guaranteed to return a page-aligned buffer, and in
      practice would waste a lot of memory due to rounding up to the next
      power of two. 144K is large enough that it usually gets allocated with
      vmalloc(), anyways. So, for send v2, replace kvmalloc() with vmalloc()
      and save the pages in an array.
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a4b333f2
    • Omar Sandoval's avatar
      btrfs: send: write larger chunks when using stream v2 · 356bbbb6
      Omar Sandoval authored
      The length field of the send stream TLV header is 16 bits. This means
      that the maximum amount of data that can be sent for one write is 64K
      minus one. However, encoded writes must be able to send the maximum
      compressed extent (128K) in one command, or more. To support this, send
      stream version 2 encodes the DATA attribute differently: it has no
      length field, and the length is implicitly up to the end of containing
      command (which has a 32bit length field). Although this is necessary
      for encoded writes, normal writes can benefit from it, too.
      
      Also add a check to enforce that the DATA attribute is last. It is only
      strictly necessary for v2, but we might as well make v1 consistent with
      it.
      
      For v2, let's bump up the send buffer to the maximum compressed extent
      size plus 16K for the other metadata (144K total). Since this will most
      likely be vmalloc'd (and always will be after the next commit), we round
      it up to the next page since we might as well use the rest of the page
      on systems with >16K pages.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      356bbbb6
    • Omar Sandoval's avatar
      btrfs: send: add stream v2 definitions · b7c14f23
      Omar Sandoval authored
      This adds the definitions of the new commands for send stream version 2
      and their respective attributes: fallocate, FS_IOC_SETFLAGS (a.k.a.
      chattr), and encoded writes. It also documents two changes to the send
      stream format in v2: the receiver shouldn't assume a maximum command
      size, and the DATA attribute is encoded differently to allow for writes
      larger than 64k. These will be implemented in subsequent changes, and
      then the ioctl will accept the new version and flag.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b7c14f23
    • Omar Sandoval's avatar
      btrfs: send: explicitly number commands and attributes · 54cab6af
      Omar Sandoval authored
      Commit e77fbf99 ("btrfs: send: prepare for v2 protocol") added
      _BTRFS_SEND_C_MAX_V* macros equal to the maximum command number for the
      version plus 1, but as written this creates gaps in the number space.
      
      The maximum command number is currently 22, and __BTRFS_SEND_C_MAX_V1 is
      accordingly 23. But then __BTRFS_SEND_C_MAX_V2 is 24, suggesting that v2
      has a command numbered 23, and __BTRFS_SEND_C_MAX is 25, suggesting that
      23 and 24 are valid commands.
      
      Instead, let's explicitly number all of the commands, attributes, and
      sentinel MAX constants.
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      54cab6af
    • Omar Sandoval's avatar
      btrfs: send: remove unused send_ctx::{total,cmd}_send_size · ca182acc
      Omar Sandoval authored
      We collect these statistics but have never exposed them in any way. I
      also didn't find any patches that ever attempted to make use of them.
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ca182acc
    • Stefan Roesch's avatar
      btrfs: sysfs: add force_chunk_alloc trigger to force allocation · 22c55e3b
      Stefan Roesch authored
      Adds write-only trigger to force new chunk allocation for a given block
      group type. It is at
      
        /sys/fs/btrfs/<uuid>/allocation/<type>/force_chunk_alloc
      
      Note: this is now only for debugging and testing and is enabled with the
            CONFIG_BTRFS_DEBUG configuration option. The transaction is
            started from sysfs context and can be problematic in some cases.
      Signed-off-by: default avatarStefan Roesch <shr@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ Changes from the original submission:
        - update changelog
        - drop unnecessary error messages
        - switch value to bool and use kstrtobool
        - move BTRFS_ATTR_W definition
        - add comment for using transaction
      ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      22c55e3b
    • Stefan Roesch's avatar
      btrfs: sysfs: export chunk size in space infos · 19fc516a
      Stefan Roesch authored
      Add new sysfs knob
      
        /sys/fs/btrfs/<uuid>/allocation/<type>/chunk_size.
      
      This allows to query the chunk size and also set the chunk size.
      
      Constraints:
      
      - can be changed by root only
      - system chunk size can't be set
      - maximum chunk size is 10% of the filesystem size
      - final value is rounded down to a multiple of 256M
      - cannot be set on zoned filesystem
      
      Note, that rounding and the 10% clamp will result to a different value
      on filesystems smaller than 10G, typically 768M.
      Signed-off-by: default avatarStefan Roesch <shr@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ Changes to original submission:
        - document setting constraints
        - drop read-only requirement
        - drop unnecessary error messages
        - fix return values of _store callback
        - use memparse for the value
        - fix rounding down to 256M
      ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      19fc516a
    • Stefan Roesch's avatar
      btrfs: store chunk size in space-info struct · f6fca391
      Stefan Roesch authored
      The chunk size is stored in the btrfs_space_info structure.  It is
      initialized at the start and is then used.
      
      A new API is added to update the current chunk size.  This API is used
      to be able to expose the chunk_size as a sysfs setting.
      Signed-off-by: default avatarStefan Roesch <shr@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ rename and merge helpers, switch atomic type to u64, style fixes ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f6fca391
    • Josef Bacik's avatar
      btrfs: do not batch insert non-consecutive dir indexes during log replay · 71b68e9e
      Josef Bacik authored
      While running generic/475 in a loop I got the following error
      
      BTRFS critical (device dm-11): corrupt leaf: root=5 block=31096832 slot=69, bad key order, prev (263 96 531) current (263 96 524)
      <snip>
       item 65 key (263 96 517) itemoff 14132 itemsize 33
       item 66 key (263 96 523) itemoff 14099 itemsize 33
       item 67 key (263 96 525) itemoff 14066 itemsize 33
       item 68 key (263 96 531) itemoff 14033 itemsize 33
       item 69 key (263 96 524) itemoff 14000 itemsize 33
      
      As you can see here we have 3 dir index keys with the dir index value of
      523, 524, and 525 inserted between 517 and 524.  This occurs because our
      dir index insertion code will bulk insert all dir index items on the
      node regardless of their actual key value.
      
      This makes sense on a normally running system, because if there's a gap
      in between the items there was a deletion before the item was inserted,
      so there's not going to be an overlap of the dir index items that need
      to be inserted and what exists on disk.
      
      However during log replay this isn't necessarily true, we could have any
      number of dir indexes in the tree already.
      
      Fix this by seeing if we're replaying the log, and if we are simply skip
      batching if there's a gap in the key space.
      
      This file system was left broken from the fstest, I tested this patch
      against the broken fs to make sure it replayed the log properly, and
      then btrfs checked the file system after the log replay to verify
      everything was ok.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      71b68e9e
    • Filipe Manana's avatar
      btrfs: reduce amount of reserved metadata for delayed item insertion · 763748b2
      Filipe Manana authored
      Whenever we want to create a new dir index item (when creating an inode,
      create a hard link, rename a file) we reserve 1 unit of metadata space
      for it in a transaction (that's 256K for a node/leaf size of 16K), and
      then create a delayed insertion item for it to be added later to the
      subvolume's tree. That unit of metadata is kept until the delayed item
      is inserted into the subvolume tree, which may take a while to happen
      (in the worst case, it's done only when the transaction commits). If we
      have multiple dir index items to insert for the same directory, say N
      index items, and they all fit in a single leaf of metadata, then we are
      holding N units of reserved metadata space when all we need is 1 unit.
      
      This change addresses that, whenever a new delayed dir index item is
      added, we release the unit of metadata the caller has reserved when it
      started the transaction if adding that new dir index item does not
      result in touching one more metadata leaf, otherwise the reservation
      is kept by transferring it from the transaction block reserve to the
      delayed items block reserve, just like before. Given that with a leaf
      size of 16K we can have a few hundred dir index items in a single leaf
      (the exact value depends on file name lengths), this reduces pressure on
      metadata reservation by releasing unnecessary space much sooner.
      
      The following fs_mark test showed some improvement when creating many
      files in parallel on machine running a non debug kernel (debian's default
      kernel config) with 12 cores:
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/nvme0n1
        MNT=/mnt/nvme0n1
        MOUNT_OPTIONS="-o ssd"
        FILES=100000
        THREADS=$(nproc --all)
      
        echo "performance" | \
            tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
        mkfs.btrfs -f $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
      
        OPTS="-S 0 -L 10 -n $FILES -s 0 -t $THREADS -k"
        for ((i = 1; i <= $THREADS; i++)); do
            OPTS="$OPTS -d $MNT/d$i"
        done
      
        fs_mark $OPTS
      
        umount $MNT
      
      Before:
      
      FSUse%        Count         Size    Files/sec     App Overhead
           2      1200000            0     225991.3          5465891
           4      2400000            0     345728.1          5512106
           4      3600000            0     346959.5          5557653
           8      4800000            0     329643.0          5587548
           8      6000000            0     312657.4          5606717
           8      7200000            0     281707.5          5727985
          12      8400000            0      88309.8          5020422
          12      9600000            0      85835.9          5207496
          16     10800000            0      81039.2          5404964
          16     12000000            0      58548.6          5842468
      
      After:
      
      FSUse%        Count         Size    Files/sec     App Overhead
           2      1200000            0     230604.5          5778375
           4      2400000            0     348908.3          5508072
           4      3600000            0     357028.7          5484337
           6      4800000            0     342898.3          5565703
           6      6000000            0     314670.8          5751555
           8      7200000            0     282548.2          5778177
          12      8400000            0      90844.9          5306819
          12      9600000            0      86963.1          5304689
          16     10800000            0      89113.2          5455248
          16     12000000            0      86693.5          5518933
      
      The "after" results are after applying this patch and all the other
      patches in the same patchset, which is comprised of the following
      changes:
      
        btrfs: balance btree dirty pages and delayed items after a rename
        btrfs: free the path earlier when creating a new inode
        btrfs: balance btree dirty pages and delayed items after clone and dedupe
        btrfs: add assertions when deleting batches of delayed items
        btrfs: deal with deletion errors when deleting delayed items
        btrfs: refactor the delayed item deletion entry point
        btrfs: improve batch deletion of delayed dir index items
        btrfs: assert that delayed item is a dir index item when adding it
        btrfs: improve batch insertion of delayed dir index items
        btrfs: do not BUG_ON() on failure to reserve metadata for delayed item
        btrfs: set delayed item type when initializing it
        btrfs: reduce amount of reserved metadata for delayed item insertion
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      763748b2
    • Filipe Manana's avatar
      btrfs: set delayed item type when initializing it · c9d02ab4
      Filipe Manana authored
      Currently we set the type of a delayed item only after successfully
      inserting it into its respective rbtree. This is fine, as the type
      is not used anywhere before that point, but for the next patch in the
      series, there will be the need to check the type of a delayed item
      before inserting it into a rbtree.
      
      So set the type of a delayed item immediately after allocating it.
      This also makes the trivial wrappers for adding insertion and deletion
      useless, so it removes them as well.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c9d02ab4
    • Filipe Manana's avatar
      btrfs: do not BUG_ON() on failure to reserve metadata for delayed item · 3bae13e9
      Filipe Manana authored
      At btrfs_insert_delayed_dir_index(), we don't expect the metadata
      reservation for the delayed dir index item insertion to fail, because the
      caller is supposed to have reserved 1 unit of metadata space for that.
      All callers are able to deal with an error in case that happens, so there
      is no need for something so drastic as a BUG_ON() in case of failure.
      Instead just emit a warning, so that's easily noticed during development
      (fstests in particular), and return the error to the caller.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3bae13e9
    • Filipe Manana's avatar
      btrfs: improve batch insertion of delayed dir index items · 06ac264f
      Filipe Manana authored
      Currently we group delayed dir index items for insertion as a single batch
      (a single btree operation) as long as their keys are sequential in the key
      space.
      
      For example we have delayed index items for the following index keys:
      
         10, 11, 12, 15, 16, 20, 21
      
      We end up building three batches:
      
      1) First one for index keys 10, 11 and 12;
      2) Second one for index keys 15 and 16;
      3) Third one for index keys 20 and 21.
      
      However, since the dir index numbers come from a monotonically increasing
      counter and are never reused, we could group all these items into a single
      batch. The existence of holes in the sequence happens only when we had
      delayed dir index items for insertion that got deleted before they were
      flushed to the subvolume's tree.
      
      The delayed items are stored in a rbtree based on their key order, so
      we can just group items into a batch as long as they all fit in a leaf,
      and ignore if there's a gap (key offset, index number) between two
      consecutive items. This is more efficient and reduces the amount of
      time spent when running delayed items if there are gaps between dir
      index items.
      
      For example running the following test script:
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/sdj
        MNT=/mnt/sdj
      
        mkfs.btrfs -f $DEV
        mount $DEV $MNT
      
        NUM_FILES=100
      
        mkdir $MNT/testdir
        for ((i = 1; i <= $NUM_FILES; i++)); do
             echo -n > $MNT/testdir/file_$i
        done
      
        # Now delete every other file, to create gaps in the dir index keys.
        for ((i = 1; i <= $NUM_FILES; i += 2)); do
            rm -f $MNT/testdir/file_$i
        done
      
        start=$(date +%s%N)
        sync
        end=$(date +%s%N)
        dur=$(( (end - start) / 1000000 ))
      
        echo -e "\nsync took $dur milliseconds"
      
        umount $MNT
      
      While having the following bpftrace script running in another shell:
      
        $ cat bpf-delayed-items-inserts.sh
        #!/usr/bin/bpftrace
      
        /* Must add 'noinline' to btrfs_insert_delayed_items(). */
        k:btrfs_insert_delayed_items
        {
            @start_insert_delayed_items[tid] = nsecs;
        }
      
        k:btrfs_insert_empty_items
        /@start_insert_delayed_items[tid]/
        {
           @insert_batches = count();
        }
      
        kr:btrfs_insert_delayed_items
        /@start_insert_delayed_items[tid]/
        {
            $dur = (nsecs - @start_insert_delayed_items[tid]) / 1000;
            @btrfs_insert_delayed_items_total_time = sum($dur);
            delete(@start_insert_delayed_items[tid]);
        }
      
      Before this change:
      
      @btrfs_insert_delayed_items_total_time: 576
      @insert_batches: 51
      
      After this change:
      
      @btrfs_insert_delayed_items_total_time: 174
      @insert_batches: 2
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      06ac264f
    • Filipe Manana's avatar
      btrfs: assert that delayed item is a dir index item when adding it · a176affe
      Filipe Manana authored
      All delayed items are for dir index items, we don't support any other item
      types at the moment. So simplify __btrfs_add_delayed_item() and add an
      assertion for checking the item's key type. This also allows the next
      change to be simpler and avoid to check key types. In case we add support
      for different item types in the future, then we'll hit the assertion
      during development and be able to adjust any code that is assuming delayed
      items are always associated to dir index items.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a176affe
    • Filipe Manana's avatar
      btrfs: improve batch deletion of delayed dir index items · 4bd02d90
      Filipe Manana authored
      Currently we group delayed dir index items for deletion in a single batch
      (single btree operation) as long as they all exist in the same leaf and as
      long as their keys are sequential in the key space. For example if we have
      a leaf that has dir index items with offsets:
      
          2, 3, 4, 6, 7, 10
      
      And we have delayed dir index items for deleting all these indexes, and
      no delayed items for any other index keys in between, then we end up
      deleting in 3 batches:
      
      1) First batch for indexes 2, 3 and 4;
      2) Second batch for indexes 6 and 7;
      3) Third batch for index 10.
      
      This is a waste because we can delete all the index keys in a single
      batch. What matters is that each consecutive delayed index key matches
      each consecutive dir index key in a leaf.
      
      So update the logic at btrfs_batch_delete_items() to check only for a
      key match between delayed dir index items and dir index items in a leaf.
      Also avoid the useless first iteration on comparing the key of the
      first slot to delete with the key of the first delayed item, as it's
      silly since they always match, as the delayed item's key was used for
      the btree search that gave us the path we have.
      
      This is more efficient and reduces runtime of running delayed items, as
      well as lock contention on the subvolume's tree.
      
      For example, the following test script:
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/sdj
        MNT=/mnt/sdj
      
        mkfs.btrfs -f $DEV
        mount $DEV $MNT
      
        NUM_FILES=1000
      
        mkdir $MNT/testdir
        for ((i = 1; i <= $NUM_FILES; i++)); do
            echo -n > $MNT/testdir/file_$i
        done
      
        # Now delete every other file, to create gaps in the dir index keys.
        for ((i = 1; i <= $NUM_FILES; i += 2)); do
            rm -f $MNT/testdir/file_$i
        done
      
        # Sync to force any delayed items to be flushed to the tree.
        sync
      
        start=$(date +%s%N)
        rm -fr $MNT/testdir
        end=$(date +%s%N)
        dur=$(( (end - start) / 1000000 ))
      
        echo -e "\nrm -fr took $dur milliseconds"
      
        umount $MNT
      
      Running that test script while having the following bpftrace script
      running in another shell:
      
        $ cat bpf-measure.sh
        #!/usr/bin/bpftrace
      
        /* Add 'noinline' to btrfs_delete_delayed_items()'s definition. */
        k:btrfs_delete_delayed_items
        {
            @start_delete_delayed_items[tid] = nsecs;
        }
      
        k:btrfs_del_items
        /@start_delete_delayed_items[tid]/
        {
            @delete_batches = count();
        }
      
        kr:btrfs_delete_delayed_items
        /@start_delete_delayed_items[tid]/
        {
            $dur = (nsecs - @start_delete_delayed_items[tid]) / 1000;
            @btrfs_delete_delayed_items_total_time = sum($dur);
            delete(@start_delete_delayed_items[tid]);
        }
      
      Before this change:
      
      @btrfs_delete_delayed_items_total_time: 9563
      @delete_batches: 1001
      
      After this change:
      
      @btrfs_delete_delayed_items_total_time: 7328
      @delete_batches: 509
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4bd02d90
    • Filipe Manana's avatar
      btrfs: refactor the delayed item deletion entry point · 36baa2c7
      Filipe Manana authored
      The delayed item deletion entry point, btrfs_delete_delayed_items(), is a
      bit convoluted for a few reasons:
      
      1) It's really a loop disguised with labels and goto statements;
      
      2) There's a 'delete_fail' label which isn't only for error cases, we can
         jump to that label even if no error happened, if we simply don't have
         more delayed items to delete;
      
      3) Unnecessarily keeps track of the current and previous items for no
         good reason, as after getting the next item and releasing the current
         one, it just jumps to the 'again' label just to look again for the
         first delayed item;
      
      4) When a delayed item is not in the tree (because it was already deleted
         before), it releases the item while holding a path locked, which is
         not necessary and adds more contention to the tree, specially taking
         into account that the path came from a deletion search, meaning we have
         write locks for nodes at levels 2, 1 and 0. And releasing the item is
         not computationally trivial (rb tree deletion, a kfree() and some
         trivial things).
      
      So refactor it to use a while loop and add some comments to make it more
      obvious why we can have delayed items without a matching item in the tree
      as well as why not keep the delayed node locked all the time when running
      all its deletion items. This is also a preparation for some upcoming work
      involving delayed items.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      36baa2c7
    • Filipe Manana's avatar
      btrfs: deal with deletion errors when deleting delayed items · 2b1d260d
      Filipe Manana authored
      Currently, btrfs_delete_delayed_items() ignores any errors returned from
      btrfs_batch_delete_items(). This looks fishy but it's not a problem at
      the moment because:
      
      1) Two of the errors returned from btrfs_batch_delete_items() are for
         impossible cases, cases where a delayed item does not match any item
         in the leaf the path points to - btrfs_delete_delayed_items() always
         calls btrfs_batch_delete_items() with a path that points to a leaf
         that contains an item matching a delayed item;
      
      2) btrfs_batch_delete_items() may return an error from btrfs_del_items(),
         in which case it does not release the delayed items of the batch.
      
         At the moment this is harmless because btrfs_del_items() actually is
         always able to delete items, even if it returns an error - when it
         returns an error it's because it ended up with a leaf mostly empty
         (less than 1/3 full) and failed to migrate items from that leaf into
         its neighbour leaves - this is not critical, as all the items were
         deleted, we just left the tree a bit unbalanced, but it's still a
         valid tree and causes no harm, and future operations on the tree will
         eventually balance it.
      
         So even if we get an error from btrfs_del_items(), the delayed items
         will not be released but the next time we run delayed items we will
         find out, at btrfs_delete_delayed_items(), that they are not present
         in the tree anymore and then release them.
      
      This is all a bit subtle, and it's certainly prone to be a disaster in
      case btrfs_del_items() changes one day and may return errors before being
      able to delete all the requested items, in which case we could leave the
      filesystem in an inconsistent state as we would commit a transaction
      despite a failure from deleting items from the tree.
      
      So make btrfs_delete_delayed_items() check for any errors from the call
      to btrfs_batch_delete_items().
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2b1d260d
    • Filipe Manana's avatar
      btrfs: add assertions when deleting batches of delayed items · 659192e6
      Filipe Manana authored
      There are a few impossible cases that btrfs_batch_delete_items() tries to
      deal with:
      
      1) Getting a path pointing to a NULL leaf;
      2) The leaf slot is pointing beyond the last item in the leaf;
      3) We can't find a single item to delete.
      
      The first case is impossible because the given path was returned by a
      successful call to btrfs_search_slot(). Replace the BUG_ON() with an
      ASSERT for this.
      
      The second case is impossible because we are always called when a delayed
      item matches an item in the given leaf. So add an ASSERT() for that and
      if that condition is not satisfied, trigger a warning and return an error.
      
      The third case is impossible exactly because of the same reason as the
      second case. The given delayed item matches one item in the leaf, so we
      know that our batch always has at least one item. Add an ASSERT to check
      that, trigger a warning if that expectation fails and return an error.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      659192e6
    • Filipe Manana's avatar
      btrfs: balance btree dirty pages and delayed items after clone and dedupe · 6fe81a3a
      Filipe Manana authored
      When reflinking extents (clone and deduplication), we need to touch the
      btree of the destination inode's subvolume, as well as potentially
      create a delayed inode for the destination inode (if it was not created
      before). However we are neither balancing the btree dirty pages nor the
      delayed items after such operations, so if we have a task that is doing
      a long series of clone or deduplication operations, it can result in
      accumulation of too many btree dirty pages and delayed items.
      
      So just call btrfs_btree_balance_dirty() after clone and deduplication,
      just like we do for every other system call that results on modifying a
      btree and adding delayed items.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6fe81a3a
    • Filipe Manana's avatar
      btrfs: free the path earlier when creating a new inode · 814e7718
      Filipe Manana authored
      When creating an inode, through btrfs_create_new_inode(), we release the
      path we allocated before once we don't need it anymore. But we keep it
      allocated until we return from that function, which is wasteful because
      after we release the path we do several things that can allocate yet
      another path: inheriting properties, setting the xattrs used by ACLs and
      secutiry modules, adding an orphan item (O_TMPFILE case) or adding a
      dir item (for the non-O_TMPFILE case).
      
      So instead of releasing the path once we don't need it anymore, free it
      instead. This way we avoid having two paths allocated until we return
      from btrfs_create_new_inode().
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      814e7718