1. 09 Dec, 2020 40 commits
    • Qu Wenruo's avatar
      btrfs: scrub: support subpage data scrub · b29dca44
      Qu Wenruo authored
      Btrfs scrub is more flexible than buffered data write path, as we can
      read an unaligned subpage data into page offset 0.
      
      This ability makes subpage support much easier, we just need to check
      each scrub_page::page_len and ensure we only calculate hash for [0,
      page_len) of a page.
      
      There is a small thing to notice: for subpage case, we still do sector
      by sector scrub.  This means we will submit a read bio for each sector
      to scrub, resulting in the same amount of read bios, just like on the 4K
      page systems.
      
      This behavior can be considered as a good thing, if we want everything
      to be the same as 4K page systems.  But this also means, we're wasting
      the possibility to submit larger bio using 64K page size.  This is
      another problem to consider in the future.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b29dca44
    • Qu Wenruo's avatar
      btrfs: scrub: support subpage tree block scrub · 53f3251d
      Qu Wenruo authored
      To support subpage tree block scrub, scrub_checksum_tree_block() only
      needs to learn 2 new tricks:
      
      - Follow sector size
        Now scrub_page only represents one sector, we need to follow it
        properly.
      
      - Run checksum on all sectors
        Since scrub_page only represents one sector, we need to run checksum
        on all sectors, not only (nodesize >> PAGE_SIZE).
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      53f3251d
    • Qu Wenruo's avatar
      btrfs: scrub: always allocate one full page for one sector for RAID56 · d0a7a9c0
      Qu Wenruo authored
      For scrub_pages() and scrub_pages_for_parity(), we currently allocate
      one scrub_page structure for one page.
      
      This is fine if we only read/write one sector one time.  But for cases
      like scrubbing RAID56, we need to read/write the full stripe, which is
      in 64K size for now.
      
      For subpage size, we will submit the read in just one page, which is
      normally a good thing, but for RAID56 case, it only expects to see one
      sector, not the full stripe in its endio function.
      This could lead to wrong parity checksum for RAID56 on subpage.
      
      To make the existing code work well for subpage case, here we take a
      shortcut by always allocating a full page for one sector.
      
      This should provide the base to make RAID56 work for subpage case.
      
      The cost is pretty obvious now, for one RAID56 stripe now we always need
      16 pages. For support subpage situation (64K page size, 4K sector size),
      this means we need full one megabyte to scrub just one RAID56 stripe.
      
      And for data scrub, each 4K sector will also need one 64K page.
      
      This is mostly just a workaround, the proper fix for this is a much
      larger project, using scrub_block to replace scrub_page, and allow
      scrub_block to handle multi pages, csums, and csum_bitmap to avoid
      allocating one page for each sector.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d0a7a9c0
    • Qu Wenruo's avatar
      btrfs: scrub: reduce width of extent_len/stripe_len from 64 to 32 bits · fa485d21
      Qu Wenruo authored
      Btrfs on-disk format chose to use u64 for almost everything, but there
      are a other restrictions that won't let us use more than u32 for things
      like extent length (the maximum length is 128MiB for non-hole extents),
      or stripe length (we have device number limit).
      
      This means if we don't have extra handling to convert u64 to u32, we
      will always have some questionable operations like
      "u32 = u64 >> sectorsize_bits" in the code.
      
      This patch will try to address the problem by reducing the width for the
      following members/parameters:
      
      - scrub_parity::stripe_len
      - @len of scrub_pages()
      - @extent_len of scrub_remap_extent()
      - @len of scrub_parity_mark_sectors_error()
      - @len of scrub_parity_mark_sectors_data()
      - @len of scrub_extent()
      - @len of scrub_pages_for_parity()
      - @len of scrub_extent_for_parity()
      
      For members extracted from on-disk structure, like map->stripe_len, they
      will be kept as is. Since that modification would require on-disk format
      change.
      
      There will be cases like "u32 = u64 - u64" or "u32 = u64", for such call
      sites, extra ASSERT() is added to be extra safe for debug builds.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fa485d21
    • Qu Wenruo's avatar
      btrfs: refactor btrfs_lookup_bio_sums to handle out-of-order bvecs · 6275193e
      Qu Wenruo authored
      Refactor btrfs_lookup_bio_sums() by:
      
      - Remove the @file_offset parameter
        There are two factors making the @file_offset parameter useless:
      
        * For csum lookup in csum tree, file offset makes no sense
          We only need disk_bytenr, which is unrelated to file_offset
      
        * page_offset (file offset) of each bvec is not contiguous.
          Pages can be added to the same bio as long as their on-disk bytenr
          is contiguous, meaning we could have pages at different file offsets
          in the same bio.
      
        Thus passing file_offset makes no sense any more.
        The only user of file_offset is for data reloc inode, we will use
        a new function, search_file_offset_in_bio(), to handle it.
      
      - Extract the csum tree lookup into search_csum_tree()
        The new function will handle the csum search in csum tree.
        The return value is the same as btrfs_find_ordered_sum(), returning
        the number of found sectors which have checksum.
      
      - Change how we do the main loop
        The only needed info from bio is:
        * the on-disk bytenr
        * the length
      
        After extracting the above info, we can do the search without bio
        at all, which makes the main loop much simpler:
      
      	for (cur_disk_bytenr = orig_disk_bytenr;
      	     cur_disk_bytenr < orig_disk_bytenr + orig_len;
      	     cur_disk_bytenr += count * sectorsize) {
      
      		/* Lookup csum tree */
      		count = search_csum_tree(fs_info, path, cur_disk_bytenr,
      					 search_len, csum_dst);
      		if (!count) {
      			/* Csum hole handling */
      		}
      	}
      
      - Use single variable as the source to calculate all other offsets
        Instead of all different type of variables, we use only one main
        variable, cur_disk_bytenr, which represents the current disk bytenr.
      
        All involved values can be calculated from that variable, and
        all those variable will only be visible in the inner loop.
      
      The above refactoring makes btrfs_lookup_bio_sums() way more robust than
      it used to be, especially related to the file offset lookup.  Now
      file_offset lookup is only related to data reloc inode, otherwise we
      don't need to bother file_offset at all.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6275193e
    • Qu Wenruo's avatar
      btrfs: remove btrfs_find_ordered_sum call from btrfs_lookup_bio_sums · 9e46458a
      Qu Wenruo authored
      The function btrfs_lookup_bio_sums() is only called for read bios.
      While btrfs_find_ordered_sum() is to search ordered extent sums, which
      is only for write path.
      
      This means to read a page we either:
      
      - Submit read bio if it's not uptodate
        This means we only need to search csum tree for checksums.
      
      - The page is already uptodate
        It can be marked uptodate for previous read, or being marked dirty.
        As we always mark page uptodate for dirty page.
        In that case, we don't need to submit read bio at all, thus no need
        to search any checksums.
      
      Remove the btrfs_find_ordered_sum() call in btrfs_lookup_bio_sums().
      And since btrfs_lookup_bio_sums() is the only caller for
      btrfs_find_ordered_sum(), also remove the implementation.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9e46458a
    • Qu Wenruo's avatar
      btrfs: handle sectorsize < PAGE_SIZE case for extent buffer accessors · 884b07d0
      Qu Wenruo authored
      To support sectorsize < PAGE_SIZE case, we need to take extra care of
      extent buffer accessors.
      
      Since sectorsize is smaller than PAGE_SIZE, one page can contain
      multiple tree blocks, we must use eb->start to determine the real offset
      to read/write for extent buffer accessors.
      
      This patch introduces two helpers to do this:
      
      - get_eb_page_index()
        This is to calculate the index to access extent_buffer::pages.
        It's just a simple wrapper around "start >> PAGE_SHIFT".
      
        For sectorsize == PAGE_SIZE case, nothing is changed.
        For sectorsize < PAGE_SIZE case, we always get index as 0, and
        the existing page shift also works.
      
      - get_eb_offset_in_page()
        This is to calculate the offset to access extent_buffer::pages.
        This needs to take extent_buffer::start into consideration.
      
        For sectorsize == PAGE_SIZE case, extent_buffer::start is always
        aligned to PAGE_SIZE, thus adding extent_buffer::start to
        offset_in_page() won't change the result.
        For sectorsize < PAGE_SIZE case, adding extent_buffer::start gives
        us the correct offset to access.
      
      This patch will touch the following parts to cover all extent buffer
      accessors:
      
      - BTRFS_SETGET_HEADER_FUNCS()
      - read_extent_buffer()
      - read_extent_buffer_to_user()
      - memcmp_extent_buffer()
      - write_extent_buffer_chunk_tree_uuid()
      - write_extent_buffer_fsid()
      - write_extent_buffer()
      - memzero_extent_buffer()
      - copy_extent_buffer_full()
      - copy_extent_buffer()
      - memcpy_extent_buffer()
      - memmove_extent_buffer()
      - btrfs_get_token_##bits()
      - btrfs_get_##bits()
      - btrfs_set_token_##bits()
      - btrfs_set_##bits()
      - generic_bin_search()
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      884b07d0
    • Qu Wenruo's avatar
      btrfs: update num_extent_pages to support subpage sized extent buffer · 4a3dc938
      Qu Wenruo authored
      For subpage sized extent buffer, we have ensured no extent buffer will
      cross page boundary, thus we would only need one page for any extent
      buffer.
      
      Update function num_extent_pages to handle such case.  Now
      num_extent_pages() returns 1 for subpage sized extent buffer.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4a3dc938
    • Qu Wenruo's avatar
      btrfs: don't allow tree block to cross page boundary for subpage support · 1aaac38c
      Qu Wenruo authored
      As a preparation for subpage sector size support (allowing filesystem
      with sector size smaller than page size to be mounted) if the sector
      size is smaller than page size, we don't allow tree block to be read if
      it crosses 64K(*) boundary.
      
      The 64K is selected because:
      
      - we are only going to support 64K page size for subpage for now
      - 64K is also the maximum supported node size
      
      This ensures that tree blocks are always contained in one page for a
      system with 64K page size, which can greatly simplify the handling.
      
      Otherwise we would have to do complex multi-page handling of tree
      blocks.  Currently there is no way to create such tree blocks.
      
      In kernel we have avoided such tree blocks allocation even on 4K page
      size, as it can lead to RAID56 stripe scrubbing.
      
      While btrfs-progs have fixed its chunk allocator since 2016 for convert,
      and has extra checks to do the same behavior as the kernel.
      
      Just add such graceful checks in case of an ancient filesystem.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1aaac38c
    • Qu Wenruo's avatar
      btrfs: calculate inline extent buffer page size based on page size · deb67895
      Qu Wenruo authored
      Btrfs only support 64K as maximum node size, thus for 4K page system, we
      would have at most 16 pages for one extent buffer.
      
      For a system using 64K page size, we would really have just one page.
      
      While we always use 16 pages for extent_buffer::pages, this means for
      systems using 64K pages, we are wasting memory for 15 page pointers
      which will never be used.
      
      Calculate the array size based on page size and the node size maximum.
      
      - for systems using 4K page size, it will stay 16 pages
      - for systems using 64K page size, it will be 1 page
      
      Move the definition of BTRFS_MAX_METADATA_BLOCKSIZE to btrfs_tree.h, to
      avoid circular inclusion of ctree.h.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      deb67895
    • Qu Wenruo's avatar
      btrfs: factor out btree page submission code to a helper · f91e0d0c
      Qu Wenruo authored
      In btree_write_cache_pages() we have a btree page submission routine
      buried deeply in a nested loop.
      
      This patch will extract that part of code into a helper function,
      submit_eb_page(), to do the same work.
      
      Since submit_eb_page() now can return >0 for successful extent
      buffer submission, remove the "ASSERT(ret <= 0);" line.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f91e0d0c
    • Qu Wenruo's avatar
      btrfs: make btrfs_verify_data_csum follow sector size · f44cf410
      Qu Wenruo authored
      Currently btrfs_verify_data_csum() just passes the whole page to
      check_data_csum(), which is fine since we only support sectorsize ==
      PAGE_SIZE.
      
      To support subpage, we need to properly honor per-sector
      checksum verification, just like what we did in dio read path.
      
      This patch will do the csum verification in a for loop, starts with
      pg_off == start - page_offset(page), with sectorsize increase for
      each loop.
      
      For sectorsize == PAGE_SIZE case, the pg_off will always be 0, and we
      will only loop once.
      
      For subpage case, we do the iterate over each sector and if we found any
      error, we return error.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f44cf410
    • Qu Wenruo's avatar
      btrfs: pass bio_offset to check_data_csum() directly · 7ffd27e3
      Qu Wenruo authored
      Parameter icsum for check_data_csum() is a little hard to understand.
      So is the phy_offset for btrfs_verify_data_csum().
      
      Both parameters are calculated values for csum lookup.
      
      Instead of some calculated value, just pass bio_offset and let the
      final and only user, check_data_csum(), calculate whatever it needs.
      
      Since we are here, also make the bio_offset parameter and some related
      variables to be u32 (unsigned int).
      As bio size is limited by its bi_size, which is unsigned int, and has
      extra size limit check during various bio operations.
      Thus we are ensured that bio_offset won't overflow u32.
      
      Thus for all involved functions, not only rename the parameter from
      @phy_offset to @bio_offset, but also reduce its width to u32, so we
      won't have suspicious "u32 = u64 >> sector_bits;" lines anymore.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7ffd27e3
    • Qu Wenruo's avatar
      btrfs: rename bio_offset of extent_submit_bio_start_t to dio_file_offset · 1941b64b
      Qu Wenruo authored
      The parameter bio_offset of extent_submit_bio_start_t is very confusing.
      If it's really bio_offset (offset to bio), then it should be u32.  But
      in fact, it's only utilized by dio read, and that member is used as file
      offset, which must be u64.
      
      Rename it to dio_file_offset since the only user uses it as file offset,
      and add comment for who is using it.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1941b64b
    • Boris Burkov's avatar
      btrfs: fix lockdep warning when creating free space tree · 8a6a87cd
      Boris Burkov authored
      A lock dependency loop exists between the root tree lock, the extent tree
      lock, and the free space tree lock.
      
      The root tree lock depends on the free space tree lock because
      btrfs_create_tree holds the new tree's lock while adding it to the root
      tree.
      
      The extent tree lock depends on the root tree lock because during
      umount, we write out space cache v1, which writes inodes in the root
      tree, which results in holding the root tree lock while doing a lookup
      in the extent tree.
      
      Finally, the free space tree depends on the extent tree because
      populate_free_space_tree holds a locked path in the extent tree and then
      does a lookup in the free space tree to add the new item.
      
      The simplest of the three to break is the one during tree creation: we
      unlock the leaf before inserting the tree node into the root tree, which
      fixes the lockdep warning.
      
        [30.480136] ======================================================
        [30.480830] WARNING: possible circular locking dependency detected
        [30.481457] 5.9.0-rc8+ #76 Not tainted
        [30.481897] ------------------------------------------------------
        [30.482500] mount/520 is trying to acquire lock:
        [30.483064] ffff9babebe03908 (btrfs-free-space-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
        [30.484054]
      	      but task is already holding lock:
        [30.484637] ffff9babebe24468 (btrfs-extent-01#2){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
        [30.485581]
      	      which lock already depends on the new lock.
      
        [30.486397]
      	      the existing dependency chain (in reverse order) is:
        [30.487205]
      	      -> #2 (btrfs-extent-01#2){++++}-{3:3}:
        [30.487825]        down_read_nested+0x43/0x150
        [30.488306]        __btrfs_tree_read_lock+0x39/0x180
        [30.488868]        __btrfs_read_lock_root_node+0x3a/0x50
        [30.489477]        btrfs_search_slot+0x464/0x9b0
        [30.490009]        check_committed_ref+0x59/0x1d0
        [30.490603]        btrfs_cross_ref_exist+0x65/0xb0
        [30.491108]        run_delalloc_nocow+0x405/0x930
        [30.491651]        btrfs_run_delalloc_range+0x60/0x6b0
        [30.492203]        writepage_delalloc+0xd4/0x150
        [30.492688]        __extent_writepage+0x18d/0x3a0
        [30.493199]        extent_write_cache_pages+0x2af/0x450
        [30.493743]        extent_writepages+0x34/0x70
        [30.494231]        do_writepages+0x31/0xd0
        [30.494642]        __filemap_fdatawrite_range+0xad/0xe0
        [30.495194]        btrfs_fdatawrite_range+0x1b/0x50
        [30.495677]        __btrfs_write_out_cache+0x40d/0x460
        [30.496227]        btrfs_write_out_cache+0x8b/0x110
        [30.496716]        btrfs_start_dirty_block_groups+0x211/0x4e0
        [30.497317]        btrfs_commit_transaction+0xc0/0xba0
        [30.497861]        sync_filesystem+0x71/0x90
        [30.498303]        btrfs_remount+0x81/0x433
        [30.498767]        reconfigure_super+0x9f/0x210
        [30.499261]        path_mount+0x9d1/0xa30
        [30.499722]        do_mount+0x55/0x70
        [30.500158]        __x64_sys_mount+0xc4/0xe0
        [30.500616]        do_syscall_64+0x33/0x40
        [30.501091]        entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [30.501629]
      	      -> #1 (btrfs-root-00){++++}-{3:3}:
        [30.502241]        down_read_nested+0x43/0x150
        [30.502727]        __btrfs_tree_read_lock+0x39/0x180
        [30.503291]        __btrfs_read_lock_root_node+0x3a/0x50
        [30.503903]        btrfs_search_slot+0x464/0x9b0
        [30.504405]        btrfs_insert_empty_items+0x60/0xa0
        [30.504973]        btrfs_insert_item+0x60/0xd0
        [30.505412]        btrfs_create_tree+0x1b6/0x210
        [30.505913]        btrfs_create_free_space_tree+0x54/0x110
        [30.506460]        btrfs_mount_rw+0x15d/0x20f
        [30.506937]        btrfs_remount+0x356/0x433
        [30.507369]        reconfigure_super+0x9f/0x210
        [30.507868]        path_mount+0x9d1/0xa30
        [30.508264]        do_mount+0x55/0x70
        [30.508668]        __x64_sys_mount+0xc4/0xe0
        [30.509186]        do_syscall_64+0x33/0x40
        [30.509652]        entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [30.510271]
      	      -> #0 (btrfs-free-space-00){++++}-{3:3}:
        [30.510972]        __lock_acquire+0x11ad/0x1b60
        [30.511432]        lock_acquire+0xa2/0x360
        [30.511917]        down_read_nested+0x43/0x150
        [30.512383]        __btrfs_tree_read_lock+0x39/0x180
        [30.512947]        __btrfs_read_lock_root_node+0x3a/0x50
        [30.513455]        btrfs_search_slot+0x464/0x9b0
        [30.513947]        search_free_space_info+0x45/0x90
        [30.514465]        __add_to_free_space_tree+0x92/0x39d
        [30.515010]        btrfs_create_free_space_tree.cold.22+0x1ee/0x45d
        [30.515639]        btrfs_mount_rw+0x15d/0x20f
        [30.516142]        btrfs_remount+0x356/0x433
        [30.516538]        reconfigure_super+0x9f/0x210
        [30.517065]        path_mount+0x9d1/0xa30
        [30.517438]        do_mount+0x55/0x70
        [30.517824]        __x64_sys_mount+0xc4/0xe0
        [30.518293]        do_syscall_64+0x33/0x40
        [30.518776]        entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [30.519335]
      	      other info that might help us debug this:
      
        [30.520210] Chain exists of:
      		btrfs-free-space-00 --> btrfs-root-00 --> btrfs-extent-01#2
      
        [30.521407]  Possible unsafe locking scenario:
      
        [30.522037]        CPU0                    CPU1
        [30.522456]        ----                    ----
        [30.522941]   lock(btrfs-extent-01#2);
        [30.523311]                                lock(btrfs-root-00);
        [30.523952]                                lock(btrfs-extent-01#2);
        [30.524620]   lock(btrfs-free-space-00);
        [30.525068]
      	       *** DEADLOCK ***
      
        [30.525669] 5 locks held by mount/520:
        [30.526116]  #0: ffff9babebc520e0 (&type->s_umount_key#37){+.+.}-{3:3}, at: path_mount+0x7ef/0xa30
        [30.527056]  #1: ffff9babebc52640 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x3d5/0x5c0
        [30.527960]  #2: ffff9babeae8f2e8 (&cache->free_space_lock#2){+.+.}-{3:3}, at: btrfs_create_free_space_tree.cold.22+0x101/0x45d
        [30.529118]  #3: ffff9babebe24468 (btrfs-extent-01#2){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
        [30.530113]  #4: ffff9babebd52eb8 (btrfs-extent-00){++++}-{3:3}, at: btrfs_try_tree_read_lock+0x16/0x100
        [30.531124]
      	      stack backtrace:
        [30.531528] CPU: 0 PID: 520 Comm: mount Not tainted 5.9.0-rc8+ #76
        [30.532166] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-4.module_el8.1.0+248+298dec18 04/01/2014
        [30.533215] Call Trace:
        [30.533452]  dump_stack+0x8d/0xc0
        [30.533797]  check_noncircular+0x13c/0x150
        [30.534233]  __lock_acquire+0x11ad/0x1b60
        [30.534667]  lock_acquire+0xa2/0x360
        [30.535063]  ? __btrfs_tree_read_lock+0x39/0x180
        [30.535525]  down_read_nested+0x43/0x150
        [30.535939]  ? __btrfs_tree_read_lock+0x39/0x180
        [30.536400]  __btrfs_tree_read_lock+0x39/0x180
        [30.536862]  __btrfs_read_lock_root_node+0x3a/0x50
        [30.537304]  btrfs_search_slot+0x464/0x9b0
        [30.537713]  ? trace_hardirqs_on+0x1c/0xf0
        [30.538148]  search_free_space_info+0x45/0x90
        [30.538572]  __add_to_free_space_tree+0x92/0x39d
        [30.539071]  ? printk+0x48/0x4a
        [30.539367]  btrfs_create_free_space_tree.cold.22+0x1ee/0x45d
        [30.539972]  btrfs_mount_rw+0x15d/0x20f
        [30.540350]  btrfs_remount+0x356/0x433
        [30.540773]  ? shrink_dcache_sb+0xd9/0x100
        [30.541203]  reconfigure_super+0x9f/0x210
        [30.541642]  path_mount+0x9d1/0xa30
        [30.542040]  do_mount+0x55/0x70
        [30.542366]  __x64_sys_mount+0xc4/0xe0
        [30.542822]  do_syscall_64+0x33/0x40
        [30.543197]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [30.543691] RIP: 0033:0x7f109f7ab93a
        [30.546042] RSP: 002b:00007ffc47c4f858 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
        [30.546770] RAX: ffffffffffffffda RBX: 00007f109f8cf264 RCX: 00007f109f7ab93a
        [30.547485] RDX: 0000557e6fc10770 RSI: 0000557e6fc19cf0 RDI: 0000557e6fc19cd0
        [30.548185] RBP: 0000557e6fc10520 R08: 0000557e6fc18e30 R09: 0000557e6fc18cb0
        [30.548911] R10: 0000000000200020 R11: 0000000000000246 R12: 0000000000000000
        [30.549606] R13: 0000557e6fc19cd0 R14: 0000557e6fc10770 R15: 0000557e6fc10520
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8a6a87cd
    • Boris Burkov's avatar
      btrfs: skip space_cache v1 setup when not using it · af456a2c
      Boris Burkov authored
      If we are not using space cache v1, we should not create the free space
      object or free space inodes. This comes up when we delete the existing
      free space objects/inodes when migrating to v2, only to see them get
      recreated for every dirtied block group.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      af456a2c
    • Boris Burkov's avatar
      btrfs: remove free space items when disabling space cache v1 · 36b216c8
      Boris Burkov authored
      When the filesystem transitions from space cache v1 to v2 or to
      nospace_cache, it removes the old cached data, but does not remove
      the FREE_SPACE items nor the free space inodes they point to. This
      doesn't cause any issues besides being a bit inefficient, since these
      items no longer do anything useful.
      
      To fix it, when we are mounting, and plan to disable the space cache,
      destroy each block group's free space item and free space inode.
      The code to remove the items is lifted from the existing use case of
      removing the block group, with a light adaptation to handle whether or
      not we have already looked up the free space inode.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      36b216c8
    • Boris Burkov's avatar
      btrfs: warn when remount will not change the free space tree · 2838d255
      Boris Burkov authored
      If the remount is ro->ro, rw->ro, or rw->rw, we will not create or
      clear the free space tree. This can be surprising, so print a warning
      to dmesg to make the failure more visible. It is also important to
      ensure that the space cache options (SPACE_CACHE, FREE_SPACE_TREE) are
      consistent, so ensure those are set to properly match the current on
      disk state (which won't be changing).
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2838d255
    • Boris Burkov's avatar
      btrfs: use superblock state to print space_cache mount option · 04c41559
      Boris Burkov authored
      To make the contents of /proc/mounts better match the actual state of
      the filesystem, base the display of the space cache mount options off
      the contents of the super block rather than the last mount options
      passed in. Since there are many scenarios where the mount will ignore a
      space cache option, simply showing the passed in option is misleading.
      
      For example, if we mount with -o remount,space_cache=v2 on a read-write
      file system without an existing free space tree, we won't build a free
      space tree, but /proc/mounts will read space_cache=v2 (until we mount
      again and it goes away)
      
      cache_generation is set iff space_cache=v1, FREE_SPACE_TREE is set iff
      space_cache=v2, and if neither is the case, we print nospace_cache.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      04c41559
    • Boris Burkov's avatar
      btrfs: keep sb cache_generation consistent with space_cache · 94846229
      Boris Burkov authored
      When mounting, btrfs uses the cache_generation in the super block to
      determine if space cache v1 is in use. However, by mounting with
      nospace_cache or space_cache=v2, it is possible to disable space cache
      v1, which does not result in un-setting cache_generation back to 0.
      
      In order to base some logic, like mount option printing in /proc/mounts,
      on the current state of the space cache rather than just the values of
      the mount option, keep the value of cache_generation consistent with the
      status of space cache v1.
      
      We ensure that cache_generation > 0 iff the file system is using
      space_cache v1. This requires committing a transaction on any mount
      which changes whether we are using v1. (v1->nospace_cache, v1->v2,
      nospace_cache->v1, v2->v1).
      
      Since the mechanism for writing out the cache generation is transaction
      commit, but we want some finer grained control over when we un-set it,
      we can't just rely on the SPACE_CACHE mount option, and introduce an
      fs_info flag that mount can use when it wants to unset the generation.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      94846229
    • Boris Burkov's avatar
      btrfs: clear free space tree on ro->rw remount · 8b228324
      Boris Burkov authored
      A user might want to revert to v1 or nospace_cache on a root filesystem,
      and much like turning on the free space tree, that can only be done
      remounting from ro->rw. Support clearing the free space tree on such
      mounts by moving it into the shared remount logic.
      
      Since the CLEAR_CACHE option sticks around across remounts, this change
      would result in clearing the tree for ever on every remount, which is
      not desirable. To fix that, add CLEAR_CACHE to the oneshot options we
      clear at mount end, which has the other bonus of not cluttering the
      /proc/mounts output with clear_cache.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8b228324
    • Boris Burkov's avatar
      btrfs: clear oneshot options on mount and remount · 8cd29088
      Boris Burkov authored
      Some options only apply during mount time and are cleared at the end
      of mount. For now, the example is USEBACKUPROOT, but CLEAR_CACHE also
      fits the bill, and this is a preparation patch for also clearing that
      option.
      
      One subtlety is that the current code only resets USEBACKUPROOT on rw
      mounts, but the option is meaningfully "consumed" by a ro mount, so it
      feels appropriate to clear in that case as well. A subsequent read-write
      remount would not go through open_ctree, which is the only place that
      checks the option, so the change should be benign.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8cd29088
    • Boris Burkov's avatar
      btrfs: create free space tree on ro->rw remount · 5011139a
      Boris Burkov authored
      When a user attempts to remount a btrfs filesystem with
      'mount -o remount,space_cache=v2', that operation silently succeeds.
      Unfortunately, this is misleading, because the remount does not create
      the free space tree. /proc/mounts will incorrectly show space_cache=v2,
      but on the next mount, the file system will revert to the old
      space_cache.
      
      For now, we handle only the easier case, where the existing mount is
      read-only and the new mount is read-write. In that case, we can create
      the free space tree without contending with the block groups changing
      as we go.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5011139a
    • Boris Burkov's avatar
      btrfs: only mark bg->needs_free_space if free space tree is on · 997e3e2e
      Boris Burkov authored
      If we attempt to create a free space tree while any block groups have
      needs_free_space set, we will double add the new free space item
      and hit EEXIST. Previously, we only created the free space tree on a new
      mount, so we never hit the case, but if we try to create it on a
      remount, such block groups could exist and trip us up.
      
      We don't do anything with this field unless the free space tree is
      enabled, so there is no harm in not setting it.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      997e3e2e
    • Boris Burkov's avatar
      btrfs: start orphan cleanup on ro->rw remount · 8f1c21d7
      Boris Burkov authored
      When we mount a rw filesystem, we start the orphan cleanup process in
      tree root and filesystem tree. However, when we remount a ro file system
      rw, we only clean the former. Move the calls to btrfs_orphan_cleanup()
      on tree_root and fs_root to the shared rw mount routine to effectively
      add them on ro->rw remount.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8f1c21d7
    • Boris Burkov's avatar
      btrfs: lift read-write mount setup from mount and remount · 44c0ca21
      Boris Burkov authored
      Mounting rw and remounting from ro to rw naturally share invariants and
      functionality which result in a correctly setup rw filesystem. Luckily,
      there is even a strong unity in the code which implements them. In
      mount's open_ctree, these operations mostly happen after an early return
      for ro file systems, and in remount, they happen in a section devoted to
      remounting ro->rw, after some remount specific validation passes.
      
      However, there are unfortunately a few differences. There are small
      deviations in the order of some of the operations, remount does not
      start orphan cleanup in root_tree or fs_tree, remount does not create
      the free space tree, and remount does not handle "one-shot" mount
      options like clear_cache and uuid tree rescan.
      
      Since we want to add building the free space tree to remount, and also
      to start the same orphan cleanup process on a filesystem mounted as ro
      then remounted rw, we would benefit from unifying the logic between the
      two code paths.
      
      This patch only lifts the existing common functionality, and leaves a
      natural path for fixing the discrepancies.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      44c0ca21
    • Filipe Manana's avatar
      btrfs: do not block inode logging for so long during transaction commit · 47876f7c
      Filipe Manana authored
      Early on during a transaction commit we acquire the tree_log_mutex and
      hold it until after we write the super blocks. But before writing the
      extent buffers dirtied by the transaction and the super blocks we unblock
      the transaction by setting its state to TRANS_STATE_UNBLOCKED and setting
      fs_info->running_transaction to NULL.
      
      This means that after that and before writing the super blocks, new
      transactions can start. However if any transaction wants to log an inode,
      it will block waiting for the transaction commit to write its dirty
      extent buffers and the super blocks because the tree_log_mutex is only
      released after those operations are complete, and starting a new log
      transaction blocks on that mutex (at start_log_trans()).
      
      Writing the dirty extent buffers and the super blocks can take a very
      significant amount of time to complete, but we could allow the tasks
      wanting to log an inode to proceed with most of their steps:
      
      1) create the log trees
      2) log metadata in the trees
      3) write their dirty extent buffers
      
      They only need to wait for the previous transaction commit to complete
      (write its super blocks) before they attempt to write their super blocks,
      otherwise we could end up with a corrupt filesystem after a crash.
      
      So change start_log_trans() to use the root tree's log_mutex to serialize
      for the creation of the log root tree instead of using the tree_log_mutex,
      and make btrfs_sync_log() acquire the tree_log_mutex before writing the
      super blocks. This allows for inode logging to wait much less time when
      there is a previous transaction that is still committing, often not having
      to wait at all, as by the time when we try to sync the log the previous
      transaction already wrote its super blocks.
      
      This patch belongs to a patch set that is comprised of the following
      patches:
      
        btrfs: fix race causing unnecessary inode logging during link and rename
        btrfs: fix race that results in logging old extents during a fast fsync
        btrfs: fix race that causes unnecessary logging of ancestor inodes
        btrfs: fix race that makes inode logging fallback to transaction commit
        btrfs: fix race leading to unnecessary transaction commit when logging inode
        btrfs: do not block inode logging for so long during transaction commit
      
      The following script that uses dbench was used to measure the impact of
      the whole patchset:
      
        $ cat test-dbench.sh
        #!/bin/bash
      
        DEV=/dev/nvme0n1
        MNT=/mnt/btrfs
        MOUNT_OPTIONS="-o ssd"
      
        echo "performance" | \
            tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
        mkfs.btrfs -f -m single -d single $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
      
        dbench -D $MNT -t 300 64
      
        umount $MNT
      
      The test was run on a machine with 12 cores, 64G of ram, using a NVMe
      device and a non-debug kernel configuration (Debian's default).
      
      Before patch set:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    11277211    0.250    85.340
       Close        8283172     0.002     6.479
       Rename        477515     1.935    86.026
       Unlink       2277936     0.770    87.071
       Deltree          256    15.732    81.379
       Mkdir            128     0.003     0.009
       Qpathinfo    10221180    0.056    44.404
       Qfileinfo    1789967     0.002     4.066
       Qfsinfo      1874399     0.003     9.176
       Sfileinfo     918589     0.061    10.247
       Find         3951758     0.341    54.040
       WriteX       5616547     0.047    85.079
       ReadX        17676028    0.005     9.704
       LockX          36704     0.003     1.800
       UnlockX        36704     0.002     0.687
       Flush         790541    14.115   676.236
      
      Throughput 1179.19 MB/sec  64 clients  64 procs  max_latency=676.240 ms
      
      After patch set:
      
      Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    12687926    0.171    86.526
       Close        9320780     0.002     8.063
       Rename        537253     1.444    78.576
       Unlink       2561827     0.559    87.228
       Deltree          374    11.499    73.549
       Mkdir            187     0.003     0.005
       Qpathinfo    11500300    0.061    36.801
       Qfileinfo    2017118     0.002     7.189
       Qfsinfo      2108641     0.003     4.825
       Sfileinfo    1033574     0.008     8.065
       Find         4446553     0.408    47.835
       WriteX       6335667     0.045    84.388
       ReadX        19887312    0.003     9.215
       LockX          41312     0.003     1.394
       UnlockX        41312     0.002     1.425
       Flush         889233    13.014   623.259
      
      Throughput 1339.32 MB/sec  64 clients  64 procs  max_latency=623.265 ms
      
      +12.7% throughput, -8.2% max latency
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      47876f7c
    • Filipe Manana's avatar
      btrfs: fix race leading to unnecessary transaction commit when logging inode · 639bd575
      Filipe Manana authored
      When logging an inode we may often have to fallback to a full transaction
      commit, either because a new block group was allocated, there is some case
      we can not deal with without a transaction commit or some error like an
      ENOMEM happened. However after we fallback to a transaction commit, we
      have a time window where we can make the next attempt to log any inode
      commit the next transaction unnecessarily, adding additional overhead and
      increasing latency.
      
      A sequence of steps that leads to this issue is the following:
      
      1) The current open transaction has a generation of 1000;
      
      2) A new block group is allocated, and as a consequence we must make sure
         any attempts to commit a log fallback to a transaction commit, so
         btrfs_set_log_full_commit() is called from btrfs_make_block_group().
         This sets fs_info->last_trans_log_full_commit to 1000;
      
      3) Task A is holding a handle on transaction 1000 and tries to log inode X.
         Once it gets to start_log_trans(), it calls btrfs_need_log_full_commit()
         which returns true, since fs_info->last_trans_log_full_commit has a
         value of 1000. So we end up returning EAGAIN and propagating it up to
         btrfs_sync_file(), where we commit transaction 1000;
      
      4) The transaction commit task (task A) sets the transaction state to
         unblocked (TRANS_STATE_UNBLOCKED);
      
      5) Some other task, task B, starts a new transaction with a generation of
         1001;
      
      6) Some stuff is done with transaction 1001, some btree blocks COWed, etc;
      
      7) Transaction 1000 has not fully committed yet, we are still writing all
         the extent buffers it created;
      
      8) Some new task, task C, starts an fsync of inode Y, gets a handle for
         transaction 1001, and it gets to btrfs_log_inode_parent() which does
         the following check:
      
           if (fs_info->last_trans_log_full_commit > last_committed) {
               ret = 1;
               goto end_no_trans;
           }
      
         At that point last_trans_log_full_commit has a value of 1000 and
         last_committed (value of fs_info->last_trans_committed) has a value of
         999, since transaction 1000 has not yet committed - it is either still
         writing out dirty extent buffers, its super blocks or unpinning
         extents.
      
         As a consequence we return 1, which gets propagated up to
         btrfs_sync_file(), which will then call btrfs_commit_transaction()
         for transaction 1001.
      
         As a consequence we have an unnecessary second transaction commit, we
         previously committed transaction 1000 and now commit transaction 1001
         as well, resulting in more overhead and increased latency.
      
      So fix this double transaction commit issue simply by removing that check,
      because all we need to do is wait for the previous transaction to finish
      its commit, which we already do later when starting the log transaction at
      start_log_trans(), because there we acquire the tree_log_mutex lock, which
      is held by a transaction commit and only released after the transaction
      commits its super blocks.
      
      Another issue that check has is that it reads last_trans_log_full_commit
      without using READ_ONCE(), which is incorrect since that member of
      struct btrfs_fs_info is always updated with WRITE_ONCE() through the
      helper btrfs_set_log_full_commit().
      
      This double transaction commit issue can actually be triggered quite often
      in long runs of dbench, since besides the creation of new block groups
      that force inode logging to fallback to a transaction commit, there are
      cases where dbench asks to fsync a directory which had files in it that
      were previously renamed or subdirectories that were removed, resulting in
      the inode logging to fallback to a full transaction commit.
      
      This patch belongs to a patch set that is comprised of the following
      patches:
      
        btrfs: fix race causing unnecessary inode logging during link and rename
        btrfs: fix race that results in logging old extents during a fast fsync
        btrfs: fix race that causes unnecessary logging of ancestor inodes
        btrfs: fix race that makes inode logging fallback to transaction commit
        btrfs: fix race leading to unnecessary transaction commit when logging inode
        btrfs: do not block inode logging for so long during transaction commit
      
      Performance results are mentioned in the change log of the last patch.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      639bd575
    • Filipe Manana's avatar
      btrfs: fix race that makes inode logging fallback to transaction commit · 47d3db41
      Filipe Manana authored
      When logging an inode and the previous transaction is still committing, we
      have a time window where we can end up incorrectly think an inode has its
      last_unlink_trans field with a value greater than the last transaction
      committed, which results in the logging to fallback to a full transaction
      commit, which is usually much more expensive than doing a log commit.
      
      The race is described by the following steps:
      
      1) We are at transaction 1000;
      
      2) We modify an inode X (a directory) using transaction 1000 and set its
         last_unlink_trans field to 1000, because for example we removed one
         of its subdirectories;
      
      3) We create a new inode Y with a dentry in inode X using transaction 1000,
         so its generation field is set to 1000;
      
      4) The commit for transaction 1000 is started by task A;
      
      5) The task committing transaction 1000 sets the transaction state to
         unblocked, writes the dirty extent buffers and the super blocks, then
         unlocks tree_log_mutex;
      
      6) Some task starts a new transaction with a generation of 1001;
      
      7) We do some modification to inode Y (using transaction 1001);
      
      8) The transaction 1000 commit starts unpinning extents. At this point
         fs_info->last_trans_committed still has a value of 999;
      
      9) Task B starts an fsync on inode Y, and gets a handle for transaction
         1001. When it gets to check_parent_dirs_for_sync() it does the checking
         of the ancestor dentries because the following check does not evaluate
         to true:
      
             if (S_ISREG(inode->vfs_inode.i_mode) &&
                 inode->generation <= last_committed &&
                 inode->last_unlink_trans <= last_committed)
                     goto out;
      
         The generation value for inode Y is 1000 and last_committed, which has
         the value read from fs_info->last_trans_committed, has a value of 999,
         so that check evaluates to false and we proceed to check the ancestor
         inodes.
      
         Once we get to the first ancestor, inode X, we call
         btrfs_must_commit_transaction() on it, which evaluates to true:
      
         static bool btrfs_must_commit_transaction(...)
         {
             struct btrfs_fs_info *fs_info = inode->root->fs_info;
             bool ret = false;
      
             mutex_lock(&inode->log_mutex);
             if (inode->last_unlink_trans > fs_info->last_trans_committed) {
                 /*
                  * Make sure any commits to the log are forced to be full
                  * commits.
                  */
                  btrfs_set_log_full_commit(trans);
                  ret = true;
             }
          (...)
      
          because inode's X last_unlink_trans has a value of 1000 and
          fs_info->last_trans_committed still has a value of 999, it returns
          true to check_parent_dirs_for_sync(), making it return 1 which is
          propagated up to btrfs_sync_file(), causing it to fallback to a full
          transaction commit of transaction 1001.
      
          We should have not fallen back to commit transaction 1001, since inode
          X had last_unlink_trans set to 1000 and the super blocks for
          transaction 1000 were already written. So while not resulting in a
          functional problem, it leads to a lot more work and higher latencies
          for a fsync since committing a transaction is usually more expensive
          than committing a log (if other filesystem changes happened under that
          transaction).
      
      Similar problem happens when logging directories, for the same reason as
      btrfs_must_commit_transaction() returns true on an inode with its
      last_unlink_trans having the generation of the previous transaction and
      that transaction is still committing, unpinning its freed extents.
      
      So fix this by comparing last_unlink_trans with the id of the current
      transaction instead of fs_info->last_trans_committed.
      
      This case is often hit when running dbench for a long enough duration, as
      it does lots of rename and rmdir operations (both update the field
      last_unlink_trans of an inode) and fsyncs of files and directories.
      
      This patch belongs to a patch set that is comprised of the following
      patches:
      
        btrfs: fix race causing unnecessary inode logging during link and rename
        btrfs: fix race that results in logging old extents during a fast fsync
        btrfs: fix race that causes unnecessary logging of ancestor inodes
        btrfs: fix race that makes inode logging fallback to transaction commit
        btrfs: fix race leading to unnecessary transaction commit when logging inode
        btrfs: do not block inode logging for so long during transaction commit
      
      Performance results are mentioned in the change log of the last patch.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      47d3db41
    • Filipe Manana's avatar
      btrfs: fix race that causes unnecessary logging of ancestor inodes · 4d6221d7
      Filipe Manana authored
      When logging an inode and we are checking if we need to log ancestors that
      are new, if the previous transaction is still committing we have a time
      window where we can unnecessarily log ancestor inodes that were created in
      the previous transaction.
      
      The race is described by the following steps:
      
      1) We are at transaction 1000;
      
      2) Directory inode X is created, its generation is set to 1000;
      
      3) The commit for transaction 1000 is started by task A;
      
      4) The task committing transaction 1000 sets the transaction state to
         unblocked, writes the dirty extent buffers and the super blocks, then
         unlocks tree_log_mutex;
      
      5) Inode Y, a regular file, is created under directory inode X, this
         results in starting a new transaction with a generation of 1001;
      
      6) The transaction 1000 commit is unpinning extents. At this point
         fs_info->last_trans_committed still has a value of 999;
      
      7) Task B calls fsync on inode Y and gets a handle for transaction 1001;
      
      8) Task B ends up at log_all_new_ancestors() and then because inode Y has
         only one hard link, ends up at log_new_ancestors_fast(). There it reads
         a value of 999 from fs_info->last_trans_committed, and sees that the
         parent inode X has a generation of 1000, so we end up logging inode X:
      
           if (inode->generation > fs_info->last_trans_committed) {
               ret = btrfs_log_inode(trans, root, inode,
                                     LOG_INODE_EXISTS, ctx);
               (...)
      
         which is not necessary since it was created in the past transaction,
         with a generation of 1000, and that transaction has already committed
         its super blocks - it's still unpinning extents so it has not yet
         updated fs_info->last_trans_committed from 999 to 1000.
      
         So this just causes us to spend more time logging and allocating and
         writing more tree blocks for the log tree.
      
      So fix this by comparing an inode's generation with the generation of the
      transaction our transaction handle refers to - if the inode's generation
      matches the generation of the current transaction than we know it is a
      new inode we need to log, otherwise don't log it.
      
      This case is often hit when running dbench for a long enough duration.
      
      This patch belongs to a patch set that is comprised of the following
      patches:
      
        btrfs: fix race causing unnecessary inode logging during link and rename
        btrfs: fix race that results in logging old extents during a fast fsync
        btrfs: fix race that causes unnecessary logging of ancestor inodes
        btrfs: fix race that makes inode logging fallback to transaction commit
        btrfs: fix race leading to unnecessary transaction commit when logging inode
        btrfs: do not block inode logging for so long during transaction commit
      
      Performance results are mentioned in the change log of the last patch.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4d6221d7
    • Filipe Manana's avatar
      btrfs: fix race that results in logging old extents during a fast fsync · 5f96bfb7
      Filipe Manana authored
      When logging the extents of an inode during a fast fsync, we have a time
      window where we can log extents that are from the previous transaction and
      already persisted. This only makes us waste time unnecessarily.
      
      The following sequence of steps shows how this can happen:
      
      1) We are at transaction 1000;
      
      2) An ordered extent E from inode I completes, that is it has gone through
         btrfs_finish_ordered_io(), and it set the extent maps' generation to
         1000 when we unpin the extent, which is the generation of the current
         transaction;
      
      3) The commit for transaction 1000 starts by task A;
      
      4) The task committing transaction 1000 sets the transaction state to
         unblocked, writes the dirty extent buffers and the super blocks, then
         unlocks tree_log_mutex;
      
      5) Some change is made to inode I, resulting in creation of a new
         transaction with a generation of 1001;
      
      6) The transaction 1000 commit starts unpinning extents. At this point
         fs_info->last_trans_committed still has a value of 999;
      
      7) Task B starts an fsync on inode I, and when it gets to
         btrfs_log_changed_extents() sees the extent map for extent E in the
         list of modified extents. It sees the extent map has a generation of
         1000 and fs_info->last_trans_committed has a value of 999, so it
         proceeds to logging the respective file extent item and all the
         checksums covering its range.
      
         So we end up wasting time since the extent was already persisted and
         is reachable through the trees pointed to by the super block committed
         by transaction 1000.
      
      So just fix this by comparing the extent maps generation against the
      generation of the transaction handle - if it is smaller then the id in the
      handle, we know the extent was already persisted and we do not need to log
      it.
      
      This patch belongs to a patch set that is comprised of the following
      patches:
      
        btrfs: fix race causing unnecessary inode logging during link and rename
        btrfs: fix race that results in logging old extents during a fast fsync
        btrfs: fix race that causes unnecessary logging of ancestor inodes
        btrfs: fix race that makes inode logging fallback to transaction commit
        btrfs: fix race leading to unnecessary transaction commit when logging inode
        btrfs: do not block inode logging for so long during transaction commit
      
      Performance results are mentioned in the change log of the last patch.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5f96bfb7
    • Filipe Manana's avatar
      btrfs: fix race causing unnecessary inode logging during link and rename · de53d892
      Filipe Manana authored
      When we are doing a rename or a link operation for an inode that was logged
      in the previous transaction and that transaction is still committing, we
      have a time window where we incorrectly consider that the inode was logged
      previously in the current transaction and therefore decide to log it to
      update it in the log. The following steps give an example on how this
      happens during a link operation:
      
      1) Inode X is logged in transaction 1000, so its logged_trans field is set
         to 1000;
      
      2) Task A starts to commit transaction 1000;
      
      3) The state of transaction 1000 is changed to TRANS_STATE_UNBLOCKED;
      
      4) Task B starts a link operation for inode X, and as a consequence it
         starts transaction 1001;
      
      5) Task A is still committing transaction 1000, therefore the value stored
         at fs_info->last_trans_committed is still 999;
      
      6) Task B calls btrfs_log_new_name(), it reads a value of 999 from
         fs_info->last_trans_committed and because the logged_trans field of
         inode X has a value of 1000, the function does not return immediately,
         instead it proceeds to logging the inode, which should not happen
         because the inode was logged in the previous transaction (1000) and
         not in the current one (1001).
      
      This is not a functional problem, just wasted time and space logging an
      inode that does not need to be logged, contributing to higher latency
      for link and rename operations.
      
      So fix this by comparing the inodes' logged_trans field with the
      generation of the current transaction instead of comparing with the value
      stored in fs_info->last_trans_committed.
      
      This case is often hit when running dbench for a long enough duration, as
      it does lots of rename operations.
      
      This patch belongs to a patch set that is comprised of the following
      patches:
      
        btrfs: fix race causing unnecessary inode logging during link and rename
        btrfs: fix race that results in logging old extents during a fast fsync
        btrfs: fix race that causes unnecessary logging of ancestor inodes
        btrfs: fix race that makes inode logging fallback to transaction commit
        btrfs: fix race leading to unnecessary transaction commit when logging inode
        btrfs: do not block inode logging for so long during transaction commit
      
      Performance results are mentioned in the change log of the last patch.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      de53d892
    • David Sterba's avatar
      btrfs: remove recalc_thresholds from free space ops · fa598b06
      David Sterba authored
      After removing the inode number cache that was using the free space
      cache code, we can remove at least the recalc_thresholds callback from
      the ops. Both code and tests use the same callback function. It's moved
      before its first use.
      
      The use_bitmaps callback is still needed by tests to create some
      extents/bitmap setup.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fa598b06
    • Nikolay Borisov's avatar
      btrfs: always set NODATASUM/NODATACOW in __create_free_space_inode · f0d1219d
      Nikolay Borisov authored
      Since it's being used solely for the freespace cache unconditionally
      set the flags required for it.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f0d1219d
    • Nikolay Borisov's avatar
      btrfs: remove crc_check logic from free space · 7dbdb443
      Nikolay Borisov authored
      Following removal of the ino cache io_ctl_init will be called only on
      behalf of the freespace inode. In this case we always want to check
      CRCs so conditional code that depended on io_ctl::check_crc can be
      removed.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7dbdb443
    • Nikolay Borisov's avatar
      btrfs: remove inode number cache feature · 5297199a
      Nikolay Borisov authored
      It's been deprecated since commit b547a88e ("btrfs: start
      deprecation of mount option inode_cache") which enumerates the reasons.
      
      A filesystem that uses the feature (mount -o inode_cache) tracks the
      inode numbers in bitmaps, that data stay on the filesystem after this
      patch. The size is roughly 5MiB for 1M inodes [1], which is considered
      small enough to be left there. Removal of the change can be implemented
      in btrfs-progs if needed.
      
      [1] https://lore.kernel.org/linux-btrfs/20201127145836.GZ6430@twin.jikos.cz/Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ update changelog ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5297199a
    • Nikolay Borisov's avatar
      btrfs: replace calls to btrfs_find_free_ino with btrfs_find_free_objectid · abadc1fc
      Nikolay Borisov authored
      The former is going away as part of the inode map removal so switch
      callers to btrfs_find_free_objectid. No functional changes since with
      INODE_MAP disabled (default) find_free_objectid was called anyway.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      abadc1fc
    • Nikolay Borisov's avatar
      btrfs: move btrfs_find_highest_objectid/btrfs_find_free_objectid to disk-io.c · ec7d6dfd
      Nikolay Borisov authored
      Those functions are going to be used even after inode cache is removed
      so moved them to a more appropriate place.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ec7d6dfd
    • David Sterba's avatar
      btrfs: drop casts of bio bi_sector · 1201b58b
      David Sterba authored
      Since commit 72deb455 ("block: remove CONFIG_LBDAF") (5.2) the
      sector_t type is u64 on all arches and configs so we don't need to
      typecast it.  It used to be unsigned long and the result of sector size
      shifts were not guaranteed to fit in the type.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1201b58b
    • Naohiro Aota's avatar
      btrfs: implement log-structured superblock for ZONED mode · 12659251
      Naohiro Aota authored
      Superblock (and its copies) is the only data structure in btrfs which
      has a fixed location on a device. Since we cannot overwrite in a
      sequential write required zone, we cannot place superblock in the zone.
      One easy solution is limiting superblock and copies to be placed only in
      conventional zones.  However, this method has two downsides: one is
      reduced number of superblock copies. The location of the second copy of
      superblock is 256GB, which is in a sequential write required zone on
      typical devices in the market today.  So, the number of superblock and
      copies is limited to be two.  Second downside is that we cannot support
      devices which have no conventional zones at all.
      
      To solve these two problems, we employ superblock log writing. It uses
      two adjacent zones as a circular buffer to write updated superblocks.
      Once the first zone is filled up, start writing into the second one.
      Then, when both zones are filled up and before starting to write to the
      first zone again, it reset the first zone.
      
      We can determine the position of the latest superblock by reading write
      pointer information from a device. One corner case is when both zones
      are full. For this situation, we read out the last superblock of each
      zone, and compare them to determine which zone is older.
      
      The following zones are reserved as the circular buffer on ZONED btrfs.
      
      - The primary superblock: zones 0 and 1
      - The first copy: zones 16 and 17
      - The second copy: zones 1024 or zone at 256GB which is minimum, and
        next to it
      
      If these reserved zones are conventional, superblock is written fixed at
      the start of the zone without logging.
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      12659251