1. 22 Jun, 2021 2 commits
    • Christophe Leroy's avatar
      btrfs: disable build on platforms having page size 256K · b05fbcc3
      Christophe Leroy authored
      With a config having PAGE_SIZE set to 256K, BTRFS build fails
      with the following message
      
        include/linux/compiler_types.h:326:38: error: call to
        '__compiletime_assert_791' declared with attribute error:
        BUILD_BUG_ON failed: (BTRFS_MAX_COMPRESSED % PAGE_SIZE) != 0
      
      BTRFS_MAX_COMPRESSED being 128K, BTRFS cannot support platforms with
      256K pages at the time being.
      
      There are two platforms that can select 256K pages:
       - hexagon
       - powerpc
      
      Disable BTRFS when 256K page size is selected. Supporting this would
      require changes to the subpage mode that's currently being developed.
      Given that 256K is many times larger than page sizes commonly used and
      for what the algorithms and structures have been tuned, it's out of
      scope and disabling build is a reasonable option.
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      [ update changelog ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b05fbcc3
    • Filipe Manana's avatar
      btrfs: send: fix invalid path for unlink operations after parent orphanization · d8ac76cd
      Filipe Manana authored
      During an incremental send operation, when processing the new references
      for the current inode, we might send an unlink operation for another inode
      that has a conflicting path and has more than one hard link. However this
      path was computed and cached before we processed previous new references
      for the current inode. We may have orphanized a directory of that path
      while processing a previous new reference, in which case the path will
      be invalid and cause the receiver process to fail.
      
      The following reproducer triggers the problem and explains how/why it
      happens in its comments:
      
        $ cat test-send-unlink.sh
        #!/bin/bash
      
        DEV=/dev/sdi
        MNT=/mnt/sdi
      
        mkfs.btrfs -f $DEV >/dev/null
        mount $DEV $MNT
      
        # Create our test files and directory. Inode 259 (file3) has two hard
        # links.
        touch $MNT/file1
        touch $MNT/file2
        touch $MNT/file3
      
        mkdir $MNT/A
        ln $MNT/file3 $MNT/A/hard_link
      
        # Filesystem looks like:
        #
        # .                                     (ino 256)
        # |----- file1                          (ino 257)
        # |----- file2                          (ino 258)
        # |----- file3                          (ino 259)
        # |----- A/                             (ino 260)
        #        |---- hard_link                (ino 259)
        #
      
        # Now create the base snapshot, which is going to be the parent snapshot
        # for a later incremental send.
        btrfs subvolume snapshot -r $MNT $MNT/snap1
        btrfs send -f /tmp/snap1.send $MNT/snap1
      
        # Move inode 257 into directory inode 260. This results in computing the
        # path for inode 260 as "/A" and caching it.
        mv $MNT/file1 $MNT/A/file1
      
        # Move inode 258 (file2) into directory inode 260, with a name of
        # "hard_link", moving first inode 259 away since it currently has that
        # location and name.
        mv $MNT/A/hard_link $MNT/tmp
        mv $MNT/file2 $MNT/A/hard_link
      
        # Now rename inode 260 to something else (B for example) and then create
        # a hard link for inode 258 that has the old name and location of inode
        # 260 ("/A").
        mv $MNT/A $MNT/B
        ln $MNT/B/hard_link $MNT/A
      
        # Filesystem now looks like:
        #
        # .                                     (ino 256)
        # |----- tmp                            (ino 259)
        # |----- file3                          (ino 259)
        # |----- B/                             (ino 260)
        # |      |---- file1                    (ino 257)
        # |      |---- hard_link                (ino 258)
        # |
        # |----- A                              (ino 258)
      
        # Create another snapshot of our subvolume and use it for an incremental
        # send.
        btrfs subvolume snapshot -r $MNT $MNT/snap2
        btrfs send -f /tmp/snap2.send -p $MNT/snap1 $MNT/snap2
      
        # Now unmount the filesystem, create a new one, mount it and try to
        # apply both send streams to recreate both snapshots.
        umount $DEV
      
        mkfs.btrfs -f $DEV >/dev/null
      
        mount $DEV $MNT
      
        # First add the first snapshot to the new filesystem by applying the
        # first send stream.
        btrfs receive -f /tmp/snap1.send $MNT
      
        # The incremental receive operation below used to fail with the
        # following error:
        #
        #    ERROR: unlink A/hard_link failed: No such file or directory
        #
        # This is because when send is processing inode 257, it generates the
        # path for inode 260 as "/A", since that inode is its parent in the send
        # snapshot, and caches that path.
        #
        # Later when processing inode 258, it first processes its new reference
        # that has the path of "/A", which results in orphanizing inode 260
        # because there is a a path collision. This results in issuing a rename
        # operation from "/A" to "/o260-6-0".
        #
        # Finally when processing the new reference "B/hard_link" for inode 258,
        # it notices that it collides with inode 259 (not yet processed, because
        # it has a higher inode number), since that inode has the name
        # "hard_link" under the directory inode 260. It also checks that inode
        # 259 has two hardlinks, so it decides to issue a unlink operation for
        # the name "hard_link" for inode 259. However the path passed to the
        # unlink operation is "/A/hard_link", which is incorrect since currently
        # "/A" does not exists, due to the orphanization of inode 260 mentioned
        # before. The path is incorrect because it was computed and cached
        # before the orphanization. This results in the receiver to fail with
        # the above error.
        btrfs receive -f /tmp/snap2.send $MNT
      
        umount $MNT
      
      When running the test, it fails like this:
      
        $ ./test-send-unlink.sh
        Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
        At subvol /mnt/sdi/snap1
        Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
        At subvol /mnt/sdi/snap2
        At subvol snap1
        At snapshot snap2
        ERROR: unlink A/hard_link failed: No such file or directory
      
      Fix this by recomputing a path before issuing an unlink operation when
      processing the new references for the current inode if we previously
      have orphanized a directory.
      
      A test case for fstests will follow soon.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d8ac76cd
  2. 21 Jun, 2021 38 commits
    • David Sterba's avatar
      btrfs: inline wait_current_trans_commit_start in its caller · ae5d29d4
      David Sterba authored
      Function wait_current_trans_commit_start is now fairly trivial so it can
      be inlined in its only caller.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ae5d29d4
    • David Sterba's avatar
      btrfs: sink wait_for_unblock parameter to async commit · 32cc4f87
      David Sterba authored
      There's only one caller left btrfs_ioctl_start_sync that passes 0, so we
      can remove the switch in btrfs_commit_transaction_async.
      
      A cleanup 9babda9f ("btrfs: Remove async_transid from
      btrfs_mksubvol/create_subvol/create_snapshot") removed calls that passed
      1, so this is a followup.
      
      As this removes last call of wait_current_trans_commit_start_and_unblock,
      remove the function as well.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      32cc4f87
    • Nathan Chancellor's avatar
      btrfs: remove total_data_size variable in btrfs_batch_insert_items() · bfaa324e
      Nathan Chancellor authored
      clang warns:
      
        fs/btrfs/delayed-inode.c:684:6: warning: variable 'total_data_size' set
        but not used [-Wunused-but-set-variable]
      	  int total_data_size = 0, total_size = 0;
      	      ^
        1 warning generated.
      
      This variable's value has been unused since commit fc0d82e1 ("btrfs:
      sink total_data parameter in setup_items_for_insert"). Eliminate it.
      
      Link: https://github.com/ClangBuiltLinux/linux/issues/1391Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarNathan Chancellor <nathan@kernel.org>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bfaa324e
    • Nikolay Borisov's avatar
      btrfs: eliminate insert label in add_falloc_range · 77d25534
      Nikolay Borisov authored
      By way of inverting the list_empty conditional the insert label can be
      eliminated, making the function's flow entirely linear.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      77d25534
    • Qu Wenruo's avatar
      btrfs: subpage: fix a rare race between metadata endio and eb freeing · 3d078efa
      Qu Wenruo authored
      [BUG]
      There is a very rare ASSERT() triggering during full fstests run for
      subpage rw support.
      
      No other reproducer so far.
      
      The ASSERT() gets triggered for metadata read in
      btrfs_page_set_uptodate() inside end_page_read().
      
      [CAUSE]
      There is still a small race window for metadata only, the race could
      happen like this:
      
                      T1                  |              T2
      ------------------------------------+-----------------------------
      end_bio_extent_readpage()           |
      |- btrfs_validate_metadata_buffer() |
      |  |- free_extent_buffer()          |
      |     Still have 2 refs             |
      |- end_page_read()                  |
         |- if (unlikely(PagePrivate())   |
         |  The page still has Private    |
         |                                | free_extent_buffer()
         |                                | |  Only one ref 1, will be
         |                                | |  released
         |                                | |- detach_extent_buffer_page()
         |                                |    |- btrfs_detach_subpage()
         |- btrfs_set_page_uptodate()     |
            The page no longer has Private|
            >>> ASSERT() triggered <<<    |
      
      This race window is super small, thus pretty hard to hit, even with so
      many runs of fstests.
      
      But the race window is still there, we have to go another way to solve
      it other than relying on random PagePrivate() check.
      
      Data path is not affected, as it will lock the page before reading,
      while unlocking the page after the last read has finished, thus no race
      window.
      
      [FIX]
      This patch will fix the bug by repurposing btrfs_subpage::readers.
      
      Now btrfs_subpage::readers will be a member shared by both metadata and
      data.
      
      For metadata path, we don't do the page unlock as metadata only relies
      on extent locking.
      
      At the same time, teach page_range_has_eb() to take
      btrfs_subpage::readers into consideration.
      
      So that even if the last eb of a page gets freed, page::private won't be
      detached as long as there still are pending end_page_read() calls.
      
      By this we eliminate the race window, this will slight increase the
      metadata memory usage, as the page may not be released as frequently as
      usual.  But it should not be a big deal.
      
      The code got introduced in ("btrfs: submit read time repair only for
      each corrupted sector"), but the fix is in a separate patch to keep the
      problem description and the crash is rare so it should not hurt
      bisectability.
      Signed-off-by: default avatarQu Wegruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3d078efa
    • Qu Wenruo's avatar
      btrfs: don't clear page extent mapped if we're not invalidating the full page · bcd77455
      Qu Wenruo authored
      [BUG]
      With current btrfs subpage rw support, the following script can lead to
      fs hang:
      
        $ mkfs.btrfs -f -s 4k $dev
        $ mount $dev -o nospace_cache $mnt
        $ fsstress -w -n 100 -p 1 -s 1608140256 -v -d $mnt
      
      The fs will hang at btrfs_start_ordered_extent().
      
      [CAUSE]
      In above test case, btrfs_invalidate() will be called with the following
      parameters:
      
        offset = 0 length = 53248 page dirty = 1 subpage dirty bitmap = 0x2000
      
      Since @offset is 0, btrfs_invalidate() will try to invalidate the full
      page, and finally call clear_page_extent_mapped() which will detach
      subpage structure from the page.
      
      And since the page no longer has subpage structure, the subpage dirty
      bitmap will be cleared, preventing the dirty range from being written
      back, thus no way to wake up the ordered extent.
      
      [FIX]
      Just follow other filesystems, only to invalidate the page if the range
      covers the full page.
      
      There are cases like truncate_setsize() which can call
      btrfs_invalidatepage() with offset == 0 and length != 0 for the last
      page of an inode.
      
      Although the old code will still try to invalidate the full page, we are
      still safe to just wait for ordered extent to finish.
      So it shouldn't cause extra problems.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bcd77455
    • Qu Wenruo's avatar
      btrfs: fix the filemap_range_has_page() call in btrfs_punch_hole_lock_range() · 0528476b
      Qu Wenruo authored
      [BUG]
      With current subpage RW support, the following script can hang the fs
      with 64K page size.
      
       # mkfs.btrfs -f -s 4k $dev
       # mount $dev -o nospace_cache $mnt
       # fsstress -w -n 50 -p 1 -s 1607749395 -d $mnt
      
      The kernel will do an infinite loop in btrfs_punch_hole_lock_range().
      
      [CAUSE]
      In btrfs_punch_hole_lock_range() we:
      
      - Truncate page cache range
      - Lock extent io tree
      - Wait any ordered extents in the range.
      
      We exit the loop until we meet all the following conditions:
      
      - No ordered extent in the lock range
      - No page is in the lock range
      
      The latter condition has a pitfall, it only works for sector size ==
      PAGE_SIZE case.
      
      While can't handle the following subpage case:
      
        0       32K     64K     96K     128K
        |       |///////||//////|       ||
      
      lockstart=32K
      lockend=96K - 1
      
      In this case, although the range crosses 2 pages,
      truncate_pagecache_range() will invalidate no page at all, but only zero
      the [32K, 96K) range of the two pages.
      
      Thus filemap_range_has_page(32K, 96K-1) will always return true, thus we
      will never meet the loop exit condition.
      
      [FIX]
      Fix the problem by doing page alignment for the lock range.
      
      Function filemap_range_has_page() has already handled lend < lstart
      case, we only need to round up @lockstart, and round_down @lockend for
      truncate_pagecache_range().
      
      This modification should not change any thing for sector size ==
      PAGE_SIZE case, as in that case our range is already page aligned.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0528476b
    • Qu Wenruo's avatar
      btrfs: reflink: make copy_inline_to_page() to be subpage compatible · 3115deb3
      Qu Wenruo authored
      The modifications are:
      
      - Page copy destination
        For subpage case, one page can contain multiple sectors, thus we can
        no longer expect the memcpy_to_page()/btrfs_decompress() to copy
        data into page offset 0.
        The correct offset is offset_in_page(file_offset) now, which should
        handle both regular sectorsize and subpage cases well.
      
      - Page status update
        Now we need to use subpage helper to handle the page status update.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3115deb3
    • Qu Wenruo's avatar
      btrfs: make btrfs_page_mkwrite() to be subpage compatible · 2d8ec40e
      Qu Wenruo authored
      Only set_page_dirty() and SetPageUptodate() is not subpage compatible.
      Convert them to subpage helpers, so that __extent_writepage_io() can
      submit page content correctly.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2d8ec40e
    • Qu Wenruo's avatar
      btrfs: make btrfs_truncate_block() to be subpage compatible · 6c9ac8be
      Qu Wenruo authored
      btrfs_truncate_block() itself is already mostly subpage compatible, the
      only missing part is the page dirtying code.
      
      Currently if we have a sector that needs to be truncated, we set the
      sector aligned range delalloc, then set the full page dirty.
      
      The problem is, current subpage code requires subpage dirty bit to be
      set, or __extent_writepage_io() won't submit bio, thus leads to ordered
      extent never to finish.
      
      So this patch will make btrfs_truncate_block() to call
      btrfs_page_set_dirty() helper to replace set_page_dirty() to fix the
      problem.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6c9ac8be
    • Qu Wenruo's avatar
      btrfs: make __extent_writepage_io() only submit dirty range for subpage · c5ef5c6c
      Qu Wenruo authored
      __extent_writepage_io() function originally just iterates through all
      the extent maps of a page, and submits any regular extents.
      
      This is fine for sectorsize == PAGE_SIZE case, as if a page is dirty, we
      need to submit the only sector contained in the page.
      
      But for subpage case, one dirty page can contain several clean sectors
      with at least one dirty sector.
      
      If __extent_writepage_io() still submit all regular extent maps, it can
      submit data which is already written to disk.
      And since such already written data won't have corresponding ordered
      extents, it will trigger a BUG_ON() in btrfs_csum_one_bio().
      
      Change the behavior of __extent_writepage_io() by finding the first
      dirty byte in the page, and only submit the dirty range other than the
      full extent.
      
      Since we're also here, also modify the following calls to be subpage
      compatible:
      
      - SetPageError()
      - end_page_writeback()
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c5ef5c6c
    • Qu Wenruo's avatar
      btrfs: make btrfs_set_range_writeback() subpage compatible · d2a91064
      Qu Wenruo authored
      Function btrfs_set_range_writeback() currently just sets the page
      writeback unconditionally.
      
      Change it to call the subpage helper so that we can handle both cases
      well.
      
      Since the subpage helpers needs btrfs_fs_info, also change the parameter
      to accept btrfs_inode.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d2a91064
    • Qu Wenruo's avatar
      btrfs: prevent extent_clear_unlock_delalloc() to unlock page not locked by __process_pages_contig() · 4750af3b
      Qu Wenruo authored
      In cow_file_range(), after we have succeeded creating an inline extent,
      we unlock the page with extent_clear_unlock_delalloc() by passing
      locked_page == NULL.
      
      For sectorsize == PAGE_SIZE case, this is just making the page lock and
      unlock harder to grab.
      
      But for incoming subpage case, it can be a big problem.
      
      For incoming subpage case, page locking have two entry points:
      
      - __process_pages_contig()
        In that case, we know exactly the range we want to lock (which only
        requires sector alignment).
        To handle the subpage requirement, we introduce btrfs_subpage::writers
        to page::private, and will update it in __process_pages_contig().
      
      - Other directly lock/unlock_page() call sites
        Those won't touch btrfs_subpage::writers at all.
      
      This means, page locked by __process_pages_contig() can only be unlocked
      by __process_pages_contig().
      Thankfully we already have the existing infrastructure in the form of
      @locked_page in various call sites.
      
      Unfortunately, extent_clear_unlock_delalloc() in cow_file_range() after
      creating an inline extent is the exception.
      It intentionally call extent_clear_unlock_delalloc() with locked_page ==
      NULL, to also unlock current page (and clear its dirty/writeback bits).
      
      To co-operate with incoming subpage modifications, and make the page
      lock/unlock pair easier to understand, this patch will still call
      extent_clear_unlock_delalloc() with locked_page, and only unlock the
      page in __extent_writepage().
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4750af3b
    • Qu Wenruo's avatar
      btrfs: update locked page dirty/writeback/error bits in __process_pages_contig · a33a8e9a
      Qu Wenruo authored
      When __process_pages_contig() gets called for
      extent_clear_unlock_delalloc(), if we hit the locked page, only Private2
      bit is updated, but dirty/writeback/error bits are all skipped.
      
      There are several call sites that call extent_clear_unlock_delalloc()
      with locked_page and PAGE_CLEAR_DIRTY/PAGE_SET_WRITEBACK/PAGE_END_WRITEBACK
      
      - cow_file_range()
      - run_delalloc_nocow()
      - cow_file_range_async()
        All for their error handling branches.
      
      For those call sites, since we skip the locked page for
      dirty/error/writeback bit update, the locked page will still have its
      subpage dirty bit remaining.
      
      Normally it's the call sites which locked the page to handle the locked
      page, but it won't hurt if we also do the update.
      
      Especially there are already other call sites doing the same thing by
      manually passing NULL as locked_page.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a33a8e9a
    • Qu Wenruo's avatar
      btrfs: make page Ordered bit to be subpage compatible · b945a463
      Qu Wenruo authored
      This involves the following modification:
      
      - Ordered extent creation
        This is done in process_one_page(), now PAGE_SET_ORDERED will call
        subpage helper to do the work.
      
      - endio functions
        This is done in btrfs_mark_ordered_io_finished().
      
      - btrfs_invalidatepage()
      
      - btrfs_cleanup_ordered_extents()
        Use the subpage page helper, and add an extra branch to exit if the
        locked page have covered the full range.
      
      Now the usage of page Ordered flag for ordered extent accounting is fully
      subpage compatible.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b945a463
    • Qu Wenruo's avatar
      btrfs: introduce helpers for subpage ordered status · 6f17400b
      Qu Wenruo authored
      This patch introduces the following functions to handle btrfs subpage
      ordered (Private2) status:
      
      - btrfs_subpage_set_ordered()
      - btrfs_subpage_clear_ordered()
      - btrfs_subpage_test_ordered()
        These helpers can only be called when the range is ensured to be
        inside the page.
      
      - btrfs_page_set_ordered()
      - btrfs_page_clear_ordered()
      - btrfs_page_test_ordered()
        These helpers can handle both regular sector size and subpage without
        problem.
      
      These functions are here to coordinate btrfs_invalidatepage() with
      btrfs_writepage_endio_finish_ordered(), to make sure only one of those
      functions can finish the ordered extent.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6f17400b
    • Qu Wenruo's avatar
      btrfs: make process_one_page() to handle subpage locking · 1e1de387
      Qu Wenruo authored
      Introduce a new data inodes specific subpage member, writers, to record
      how many sectors are under page lock for delalloc writing.
      
      This member acts pretty much the same as readers, except it's only for
      delalloc writes.
      
      This is important for delalloc code to trace which page can really be
      freed, as we have cases like run_delalloc_nocow() where we may exit
      processing nocow range inside a page, but need to exit to do cow half
      way.
      In that case, we need a way to determine if we can really unlock a full
      page.
      
      With the new btrfs_subpage::writers, there is a new requirement:
      - Page locked by process_one_page() must be unlocked by
        process_one_page()
        There are still tons of call sites manually lock and unlock a page,
        without updating btrfs_subpage::writers.
        So if we lock a page through process_one_page() then it must be
        unlocked by process_one_page() to keep btrfs_subpage::writers
        consistent.
      
        This will be handled in next patch.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1e1de387
    • Qu Wenruo's avatar
      btrfs: make end_bio_extent_writepage() to be subpage compatible · 9047e317
      Qu Wenruo authored
      Now in end_bio_extent_writepage(), the only subpage incompatible code is
      the end_page_writeback().
      
      Just call the subpage helpers.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9047e317
    • Qu Wenruo's avatar
      btrfs: make __process_pages_contig() to handle subpage dirty/error/writeback status · e38992be
      Qu Wenruo authored
      For __process_pages_contig() and process_one_page(), to handle subpage
      we only need to pass bytenr in and call subpage helpers to handle
      dirty/error/writeback status.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e38992be
    • Qu Wenruo's avatar
      btrfs: make btrfs_dirty_pages() to be subpage compatible · f02a85d2
      Qu Wenruo authored
      Since the extent io tree operations in btrfs_dirty_pages() are already
      subpage compatible, we only need to make the page status update to use
      subpage helpers.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f02a85d2
    • Qu Wenruo's avatar
      btrfs: only require sector size alignment for end_bio_extent_writepage() · 321a02db
      Qu Wenruo authored
      Just like read page, for subpage support we only require sector size
      alignment.
      
      So change the error message condition to only require sector alignment.
      
      This should not affect existing code, as for regular sectorsize ==
      PAGE_SIZE case, we are still requiring page alignment.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      321a02db
    • Qu Wenruo's avatar
      btrfs: provide btrfs_page_clamp_*() helpers · 60e2d255
      Qu Wenruo authored
      In the coming subpage RW supports, there are a lot of page status update
      calls which need to be converted to subpage compatible version, which
      needs @start and @len.
      
      Some call sites already have such @start/@len and are already in
      page range, like various endio functions.
      
      But there are also call sites which need to clamp the range for subpage
      case, like btrfs_dirty_pagse() and __process_contig_pages().
      
      Here we introduce new helpers, btrfs_page_clamp_*(), to do and only do the
      clamp for subpage version.
      
      Although in theory all existing btrfs_page_*() calls can be converted to
      use btrfs_page_clamp_*() directly, but that would make us to do
      unnecessary clamp operations.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      60e2d255
    • Qu Wenruo's avatar
      btrfs: refactor page status update into process_one_page() · ed8f13bf
      Qu Wenruo authored
      In __process_pages_contig() we update page status according to page_ops.
      
      That update process is a bunch of 'if' branches, which lie inside
      two loops, this makes it pretty hard to expand for later subpage
      operations.
      
      So this patch will extract these operations into its own function,
      process_one_pages().
      
      Also since we're refactoring __process_pages_contig(), also move the new
      helper and __process_pages_contig() before the first caller of them, to
      remove the forward declaration.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ed8f13bf
    • Qu Wenruo's avatar
      btrfs: pass bytenr directly to __process_pages_contig() · 98af9ab1
      Qu Wenruo authored
      As a preparation for incoming subpage support, we need bytenr passed to
      __process_pages_contig() directly, not the current page index.
      
      So change the parameter and all callers to pass bytenr in.
      
      With the modification, here we need to replace the old @index_ret with
      @processed_end for __process_pages_contig(), but this brings a small
      problem.
      
      Normally we follow the inclusive return value, meaning @processed_end
      should be the last byte we processed.
      
      If parameter @start is 0, and we failed to lock any page, then we would
      return @processed_end as -1, causing more problems for
      __unlock_for_delalloc().
      
      So here for @processed_end, we use two different return value patterns.
      If we have locked any page, @processed_end will be the last byte of
      locked page.
      Or it will be @start otherwise.
      
      This change will impact lock_delalloc_pages(), so it needs to check
      @processed_end to only unlock the range if we have locked any.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      98af9ab1
    • Qu Wenruo's avatar
      btrfs: fix hang when run_delalloc_range() failed · 968f2566
      Qu Wenruo authored
      [BUG]
      When running subpage preparation patches on x86, btrfs/125 will hang
      forever with one ordered extent never finished.
      
      [CAUSE]
      The test case btrfs/125 itself will always fail as the fix is never merged.
      
      When the test fails at balance, btrfs needs to cleanup the ordered
      extent in btrfs_cleanup_ordered_extents() for data reloc inode.
      
      The problem is in the sequence how we cleanup the page Order bit.
      
      Currently it works like:
      
        btrfs_cleanup_ordered_extents()
        |- find_get_page();
        |- btrfs_page_clear_ordered(page);
        |  Now the page doesn't have Ordered bit anymore.
        |  !!! This also includes the first (locked) page !!!
        |
        |- offset += PAGE_SIZE
        |  This is to skip the first page
        |- __endio_write_update_ordered()
           |- btrfs_mark_ordered_io_finished(NULL)
              Except the first page, all ordered extents are finished.
      
      Then the locked page is cleaned up in __extent_writepage():
      
        __extent_writepage()
        |- If (PageError(page))
        |- end_extent_writepage()
           |- btrfs_mark_ordered_io_finished(page)
              |- if (btrfs_test_page_ordered(page))
              |-  !!! The page gets skipped !!!
                  The ordered extent is not decreased as the page doesn't
                  have ordered bit anymore.
      
      This leaves the ordered extent with bytes_left == sectorsize, thus never
      finish.
      
      [FIX]
      The fix is to ensure we never clear page Ordered bit without running the
      ordered extent accounting.
      
      Here we choose to skip the locked page in
      btrfs_cleanup_ordered_extents() so that later end_extent_writepage() can
      properly finish the ordered extent.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      968f2566
    • Qu Wenruo's avatar
      btrfs: rename PagePrivate2 to PageOrdered inside btrfs · f57ad937
      Qu Wenruo authored
      Inside btrfs we use Private2 page status to indicate we have an ordered
      extent with pending IO for the sector.
      
      But the page status name, Private2, tells us nothing about the bit
      itself, so this patch will rename it to Ordered.
      And with extra comment about the bit added, so reader who is still
      uncertain about the page Ordered status, will find the comment pretty
      easily.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f57ad937
    • Qu Wenruo's avatar
      btrfs: refactor btrfs_invalidatepage() for subpage support · 3b835840
      Qu Wenruo authored
      This patch will refactor btrfs_invalidatepage() for the incoming subpage
      support.
      
      The involved modifications are:
      
      - Use while() loop instead of "goto again;"
      - Use single variable to determine whether to delete extent states
        Each branch will also have comments why we can or cannot delete the
        extent states
      - Do qgroup free and extent states deletion per-loop
        Current code can only work for PAGE_SIZE == sectorsize case.
      
      This refactor also makes it clear what we do for different sectors:
      
      - Sectors without ordered extent
        We're completely safe to remove all extent states for the sector(s)
      
      - Sectors with ordered extent, but no Private2 bit
        This means the endio has already been executed, we can't remove all
        extent states for the sector(s).
      
      - Sectors with ordere extent, still has Private2 bit
        This means we need to decrease the ordered extent accounting.
        And then it comes to two different variants:
      
        * We have finished and removed the ordered extent
          Then it's the same as "sectors without ordered extent"
        * We didn't finished the ordered extent
          We can remove some extent states, but not all.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3b835840
    • Qu Wenruo's avatar
      btrfs: introduce btrfs_lookup_first_ordered_range() · c095f333
      Qu Wenruo authored
      Although we already have btrfs_lookup_first_ordered_extent() and
      btrfs_lookup_ordered_extent(), they all have their own limitations:
      
      - btrfs_lookup_ordered_extent() can't do extra range check
      
        It's only designed to lookup any ordered extent before certain bytenr.
      
      - btrfs_lookup_first_ordered_extent() may not return the first ordered
        extent in the range
      
        It doesn't ensure the first ordered extent is returned.
        The existing callers are only interested in exhausting all ordered
        extents in a range, the order is not important.
      
      For incoming btrfs_invalidatepage() refactoring, we need a way to
      properly iterate all ordered extents in their bytenr order of a range.
      
      So this patch will introduce a new function,
      btrfs_lookup_first_ordered_range(), to do ordered extent with bytenr
      order awareness and extra range check.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c095f333
    • Qu Wenruo's avatar
      btrfs: update comments in btrfs_invalidatepage() · 266a2586
      Qu Wenruo authored
      The existing comments in btrfs_invalidatepage() don't really get to the
      point, especially for what Private2 is really representing and how the
      race avoidance is done.
      
      The truth is, there are only three entrances to do ordered extent
      accounting:
      
      - btrfs_writepage_endio_finish_ordered()
      - __endio_write_update_ordered()
        Those two entrance are just endio functions for dio and buffered
        write.
      
      - btrfs_invalidatepage()
      
      But there is a pitfall, in endio functions there is no check on whether
      the ordered extent is already accounted.
      They just blindly clear the Private2 bit and do the accounting.
      
      So it's all btrfs_invalidatepage()'s responsibility to make sure we
      won't do double account for the same sector.
      
      That's why in btrfs_invalidatepage() we have to wait for page writeback,
      this will ensure all submitted bios have finished, thus their endio
      functions have finished the accounting on the ordered extent.
      
      Then we also check page Private2 to ensure that, we only run ordered
      extent accounting on pages who has no bio submitted.
      
      This patch will rework related comments to make it more clear on the
      race and how we use wait_on_page_writeback() and Private2 to prevent
      double accounting on ordered extent.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      266a2586
    • Qu Wenruo's avatar
      btrfs: refactor how we finish ordered extent io for endio functions · e65f152e
      Qu Wenruo authored
      Btrfs has two endio functions to mark certain io range finished for
      ordered extents:
      
      - __endio_write_update_ordered()
        This is for direct IO
      
      - btrfs_writepage_endio_finish_ordered()
        This for buffered IO.
      
      However they go different routines to handle ordered extent io:
      
      - Whether to iterate through all ordered extents
        __endio_write_update_ordered() will but
        btrfs_writepage_endio_finish_ordered() will not.
      
        In fact, iterating through all ordered extents will benefit later
        subpage support, while for current PAGE_SIZE == sectorsize requirement
        this behavior makes no difference.
      
      - Whether to update page Private2 flag
        __endio_write_update_ordered() will not update page Private2 flag as
        for iomap direct IO, the page can not be even mapped.
        While btrfs_writepage_endio_finish_ordered() will clear Private2 to
        prevent double accounting against btrfs_invalidatepage().
      
      Those differences are pretty subtle, and the ordered extent iterations
      code in callers makes code much harder to read.
      
      So this patch will introduce a new function,
      btrfs_mark_ordered_io_finished(), to do the heavy lifting:
      
      - Iterate through all ordered extents in the range
      - Do the ordered extent accounting
      - Queue the work for finished ordered extent
      
      This function has two new feature:
      
      - Proper underflow detection and recovery
        The old underflow detection will only detect the problem, then
        continue.
        No proper info like root/inode/ordered extent info, nor noisy enough
        to be caught by fstests.
      
        Furthermore when underflow happens, the ordered extent will never
        finish.
      
        New error detection will reset the bytes_left to 0, do proper
        kernel warning, and output extra info including root, ino, ordered
        extent range, the underflow value.
      
      - Prevent double accounting based on Private2 flag
        Now if we find a range without Private2 flag, we will skip to next
        range.
        As that means someone else has already finished the accounting of
        ordered extent.
      
        This makes no difference for current code, but will be a critical part
        for incoming subpage support, as we can call
        btrfs_mark_ordered_io_finished() for multiple sectors if they are
        beyond inode size.
        Thus such double accounting prevention is a key feature for subpage.
      
      Now both endio functions only need to call that new function.
      
      And since the only caller of btrfs_dec_test_first_ordered_pending() is
      removed, also remove btrfs_dec_test_first_ordered_pending() completely.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e65f152e
    • Qu Wenruo's avatar
      btrfs: make Private2 lifespan more consistent · 87b4d86b
      Qu Wenruo authored
      Currently we use page Private2 bit to indicate that we have ordered
      extent for the page range.
      
      But the lifespan of it is not consistent, during regular writeback path,
      there are two locations to clear the same PagePrivate2:
      
          T ----- Page marked Dirty
          |
          + ----- Page marked Private2, through btrfs_run_dealloc_range()
          |
          + ----- Page cleared Private2, through btrfs_writepage_cow_fixup()
          |       in __extent_writepage_io()
          |       ^^^ Private2 cleared for the first time
          |
          + ----- Page marked Writeback, through btrfs_set_range_writeback()
          |       in __extent_writepage_io().
          |
          + ----- Page cleared Private2, through
          |       btrfs_writepage_endio_finish_ordered()
          |       ^^^ Private2 cleared for the second time.
          |
          + ----- Page cleared Writeback, through
                  btrfs_writepage_endio_finish_ordered()
      
      Currently PagePrivate2 is mostly to prevent ordered extent accounting
      being executed for both endio and invalidatepage.
      Thus only the one who cleared page Private2 is responsible for ordered
      extent accounting.
      
      But the fact is, in btrfs_writepage_endio_finish_ordered(), page
      Private2 is cleared and ordered extent accounting is executed
      unconditionally.
      
      The race prevention only happens through btrfs_invalidatepage(), where
      we wait for the page writeback first, before checking the Private2 bit.
      
      This means, Private2 is also protected by Writeback bit, and there is no
      need for btrfs_writepage_cow_fixup() to clear Priavte2.
      
      This patch will change btrfs_writepage_cow_fixup() to just check
      PagePrivate2, not to clear it.
      The clearing will happen in either btrfs_invalidatepage() or
      btrfs_writepage_endio_finish_ordered().
      
      This makes the Private2 bit easier to understand, just meaning the page
      has unfinished ordered extent attached to it.
      
      And this patch is a hard requirement for the incoming refactoring for
      how we finished ordered IO for endio context, as the coming patch will
      check Private2 to determine if we need to do the ordered extent
      accounting.  Thus this patch is definitely needed or we will hang due to
      unfinished ordered extent.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      87b4d86b
    • Qu Wenruo's avatar
      btrfs: pass btrfs_inode to btrfs_writepage_endio_finish_ordered() · 38a39ac7
      Qu Wenruo authored
      There is a pretty bad abuse of btrfs_writepage_endio_finish_ordered() in
      end_compressed_bio_write().
      
      It passes compressed pages to btrfs_writepage_endio_finish_ordered(),
      which is only supposed to accept inode pages.
      
      Thankfully the important info here is the inode, so let's pass
      btrfs_inode directly into btrfs_writepage_endio_finish_ordered(), and
      make @page parameter optional.
      
      By this, end_compressed_bio_write() can happily pass page=NULL while
      still getting everything done properly.
      
      Also, to cooperate with such modification, replace @page parameter for
      trace_btrfs_writepage_end_io_hook() with btrfs_inode.
      Although this removes page_index info, the existing start/len should be
      enough for most usage.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      38a39ac7
    • Qu Wenruo's avatar
      btrfs: make subpage metadata write path call its own endio functions · fa04c165
      Qu Wenruo authored
      For subpage metadata, we're reusing two functions for subpage metadata
      write:
      
      - end_bio_extent_buffer_writepage()
      - write_one_eb()
      
      But the truth is, for subpage we just call
      end_bio_subpage_eb_writepage() without using any bit in
      end_bio_extent_buffer_writepage().
      
      For write_one_eb(), it's pretty similar, but with a small part of code
      reused.
      
      There is really no need to pollute the existing code path if we're not
      really using most of them.
      
      So this patch will do the following change to separate the subpage
      metadata write path from regular write path by:
      
      - Use end_bio_subpage_eb_writepage() directly as endio in
        write_one_subpage_eb()
      - Directly call write_one_subpage_eb() in submit_eb_subpage()
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fa04c165
    • Qu Wenruo's avatar
      btrfs: refactor submit_extent_page() to make bio and its flag tracing easier · 390ed29b
      Qu Wenruo authored
      There is a lot of code inside extent_io.c needs both "struct bio
      **bio_ret" and "unsigned long prev_bio_flags", along with some
      parameters like "unsigned long bio_flags".
      
      Such strange parameters are here for bio assembly.
      
      For example, we have such inode page layout:
      
        0       4K      8K      12K
        |<-- Extent A-->|<- EB->|
      
      Then what we do is:
      
      - Page [0, 4K)
        *bio_ret = NULL
        So we allocate a new bio to bio_ret,
        Add page [0, 4K) to *bio_ret.
      
      - Page [4K, 8K)
        *bio_ret != NULL
        We found this page is continuous to *bio_ret,
        and if we're not at stripe boundary, we
        add page [4K, 8K) to *bio_ret.
      
      - Page [8K, 12K)
        *bio_ret != NULL
        But we found this page is not continuous, so
        we submit *bio_ret, then allocate a new bio,
        and add page [8K, 12K) to the new bio.
      
      This means we need to record both the bio and its bio_flag, but we
      record them manually using those strange parameter list, other than
      encapsulating them into their own structure.
      
      So this patch will introduce a new structure, btrfs_bio_ctrl, to record
      both the bio, and its bio_flags.
      
      Also, in above case, for all pages added to the bio, we need to check if
      the new page crosses stripe boundary.  This check itself can be time
      consuming, and we don't really need to do that for each page.
      
      This patch also integrates the stripe boundary check into btrfs_bio_ctrl.
      When a new bio is allocated, the stripe and ordered extent boundary is
      also calculated, so no matter how large the bio will be, we only
      calculate the boundaries once, to save some CPU time.
      
      The following functions/structures are affected:
      
      - struct extent_page_data
        Replace its bio pointer with structure btrfs_bio_ctrl (embedded
        structure, not pointer)
      
      - end_write_bio()
      - flush_write_bio()
        Just change how bio is fetched
      
      - btrfs_bio_add_page()
        Use pre-calculated boundaries instead of re-calculating them.
        And use @bio_ctrl to replace @bio and @prev_bio_flags.
      
      - calc_bio_boundaries()
        New function
      
      - submit_extent_page() callers
      - btrfs_do_readpage() callers
      - contiguous_readpages() callers
        To Use @bio_ctrl to replace @bio and @prev_bio_flags, and how to grab
        bio.
      
      - btrfs_bio_fits_in_ordered_extent()
        Removed, as now the ordered extent size limit is done at bio
        allocation time, no need to check for each page range.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      390ed29b
    • Qu Wenruo's avatar
      btrfs: allow btrfs_bio_fits_in_stripe() to accept bio without any page · 1a0b5c4d
      Qu Wenruo authored
      Function btrfs_bio_fits_in_stripe() now requires a bio with at least one
      page added.  Or btrfs_get_chunk_map() will fail with -ENOENT.
      
      But in fact this requirement is not needed at all, as we can just pass
      sectorsize for btrfs_get_chunk_map().
      
      This tiny behavior change is important for later subpage refactoring on
      submit_extent_page().
      
      As for 64K page size, we can have a page range with pgoff=0 and size=64K.
      If the logical bytenr is just 16K before the stripe boundary, we have to
      split the page range into two bios.
      
      This means, we must check page range against stripe boundary, even adding
      the range to an empty bio.
      
      This tiny refactoring is for the incoming changes, but on its own,
      regular sectorsize == PAGE_SIZE is not affected anyway.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1a0b5c4d
    • Qu Wenruo's avatar
      btrfs: remove the unused parameter @len for btrfs_bio_fits_in_stripe() · 43c0d1a5
      Qu Wenruo authored
      The parameter @len is not really used in btrfs_bio_fits_in_stripe(),
      just remove it.
      
      It got removed in 42034313 ("btrfs: let callers of
      btrfs_get_io_geometry pass the em"), before that btrfs_get_chunk_map
      utilized it.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      43c0d1a5
    • Qu Wenruo's avatar
      btrfs: make free space cache size consistent across different PAGE_SIZE · 0044ae11
      Qu Wenruo authored
      Currently free space cache inode size is determined by two factors:
      
      - block group size
      - PAGE_SIZE
      
      This means, for the same sized block groups, with different PAGE_SIZE,
      it will result in different inode sizes.
      
      This will not be a good thing for subpage support, so change the
      requirement for PAGE_SIZE to sectorsize.
      
      Now for the same 4K sectorsize btrfs, it should result the same inode
      size no matter what the PAGE_SIZE is.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0044ae11
    • Qu Wenruo's avatar
      btrfs: scrub: fix subpage repair error caused by hard coded PAGE_SIZE · 8df507cb
      Qu Wenruo authored
      [BUG]
      For the following file layout, scrub will not be able to repair all
      these two repairable error, but in fact make one corruption even
      unrepairable:
      
      	  inode offset 0      4k     8K
      Mirror 1               |XXXXXX|      |
      Mirror 2               |      |XXXXXX|
      
      [CAUSE]
      The root cause is the hard coded PAGE_SIZE, which makes scrub repair to
      go crazy for subpage.
      
      For above case, when reading the first sector, we use PAGE_SIZE other
      than sectorsize to read, which makes us to read the full range [0, 64K).
      In fact, after 8K there may be no data at all, we can just get some
      garbage.
      
      Then when doing the repair, we also writeback a full page from mirror 2,
      this means, we will also writeback the corrupted data in mirror 2 back
      to mirror 1, leaving the range [4K, 8K) unrepairable.
      
      [FIX]
      This patch will modify the following PAGE_SIZE use with sectorsize:
      
      - scrub_print_warning_inode()
        Remove the min() and replace PAGE_SIZE with sectorsize.
        The min() makes no sense, as csum is done for the full sector with
        padding.
      
        This fixes a bug that subpage report extra length like:
         checksum error at logical 298844160 on dev /dev/mapper/arm_nvme-test,
         physical 575668224, root 5, inode 257, offset 0, length 12288, links 1 (path: file)
      
        Where the error is only 1 sector.
      
      - scrub_handle_errored_block()
        Comments with PAGE|page involved, all changed to sector.
      
      - scrub_setup_recheck_block()
      - scrub_repair_page_from_good_copy()
      - scrub_add_page_to_wr_bio()
      - scrub_wr_submit()
      - scrub_add_page_to_rd_bio()
      - scrub_block_complete()
        Replace PAGE_SIZE with sectorsize.
        This solves several problems where we read/write extra range for
        subpage case.
      
      RAID56 code is excluded intentionally, as RAID56 has extra PAGE_SIZE
      usage, and is not really safe enough.
      Thus we will reject RAID56 for subpage in later commit.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8df507cb