1. 23 Aug, 2021 40 commits
    • Qu Wenruo's avatar
      btrfs: allow read-write for 4K sectorsize on 64K page size systems · 95ea0486
      Qu Wenruo authored
      Since now we support data and metadata read-write for subpage, remove
      the RO requirement for subpage mount.
      
      There are some extra limitations though:
      
      - For now, subpage RW mount is still considered experimental
        Thus that mount warning will still be there.
      
      - No compression support
        There are still quite some PAGE_SIZE hard coded and quite some call
        sites use extent_clear_unlock_delalloc() to unlock locked_page.
        This will screw up subpage helpers.
      
        Now for subpage RW mount, no matter what mount option or inode attr is
        set, all writes will not be compressed.  Although reading compressed
        data has no problem.
      
      - No defrag for subpage case
        The defrag support for subpage case will come in later patches, which
        will also rework the defrag workflow.
      
      - No inline extent will be created
        This is mostly due to the fact that filemap_fdatawrite_range() will
        trigger more write than the range specified.
        In fallocate calls, this behavior can make us to writeback which can
        be inlined, before we enlarge the i_size.
      
        This is a very special corner case, and even current btrfs check won't
        report error on such inline extent + regular extent.
        But considering how much effort has been put to prevent such inline +
        regular, I'd prefer to cut off inline extent completely until we have
        a good solution.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      95ea0486
    • Qu Wenruo's avatar
      btrfs: subpage: fix relocation potentially overwriting last page data · 9d9ea1e6
      Qu Wenruo authored
      [BUG]
      When using the following script, btrfs will report data corruption after
      one data balance with subpage support:
      
        mkfs.btrfs -f -s 4k $dev
        mount $dev -o nospace_cache $mnt
        $fsstress -w -n 8 -s 1620948986 -d $mnt/ -v > /tmp/fsstress
        sync
        btrfs balance start -d $mnt
        btrfs scrub start -B $mnt
      
      Similar problem can be easily observed in btrfs/028 test case, there
      will be tons of balance failure with -EIO.
      
      [CAUSE]
      Above fsstress will result the following data extents layout in extent
      tree:
        item 10 key (13631488 EXTENT_ITEM 98304) itemoff 15889 itemsize 82
          refs 2 gen 7 flags DATA
          extent data backref root FS_TREE objectid 259 offset 1339392 count 1
          extent data backref root FS_TREE objectid 259 offset 647168 count 1
        item 11 key (13631488 BLOCK_GROUP_ITEM 8388608) itemoff 15865 itemsize 24
          block group used 102400 chunk_objectid 256 flags DATA
        item 12 key (13733888 EXTENT_ITEM 4096) itemoff 15812 itemsize 53
          refs 1 gen 7 flags DATA
          extent data backref root FS_TREE objectid 259 offset 729088 count 1
      
      Then when creating the data reloc inode, the data reloc inode will look
      like this:
      
      	0	32K	64K	96K 100K	104K
      	|<------ Extent A ----->|   |<- Ext B ->|
      
      Then when we first try to relocate extent A, we setup the data reloc
      inode with i_size 96K, then read both page [0, 64K) and page [64K, 128K).
      
      For page 64K, since the i_size is just 96K, we fill range [96K, 128K)
      with 0 and set it uptodate.
      
      Then when we come to extent B, we update i_size to 104K, then try to read
      page [64K, 128K).
      Then we find the page is already uptodate, so we skip the read.
      But range [96K, 128K) is filled with 0, not the real data.
      
      Then we writeback the data reloc inode to disk, with 0 filling range
      [96K, 128K), corrupting the content of extent B.
      
      The behavior is caused by the fact that we still do full page read for
      subpage case.
      
      The bug won't really happen for regular sectorsize, as one page only
      contains one sector.
      
      [FIX]
      This patch will fix the problem by invalidating range [i_size, PAGE_END]
      in prealloc_file_extent_cluster().
      
      So that if above example happens, when we preallocate the file extent
      for extent B, we will clear the uptodate bits for range [96K, 128K),
      allowing later relocate_one_page() to re-read the needed range.
      
      There is a special note for the invalidating part.
      
      Since we're not calling real btrfs_invalidatepage(), but just clearing
      the subpage and page uptodate bits, we can leave a page half dirty and
      half out of date.
      
      Reading such page can cause a deadlock, as we normally expect a dirty
      page to be fully uptodate.
      
      Thus here we flush and wait the data reloc inode before doing the hacked
      invalidating.  This won't cause extra overhead, as we're going to
      writeback the data later anyway.
      Reported-by: default avatarRitesh Harjani <riteshh@linux.ibm.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9d9ea1e6
    • Qu Wenruo's avatar
      btrfs: subpage: fix false alert when relocating partial preallocated data extents · e3c62324
      Qu Wenruo authored
      [BUG]
      When relocating partial preallocated data extents (part of the
      preallocated extent is written) for subpage, it can cause the following
      false alert and make the relocation to fail:
      
        BTRFS info (device dm-3): balance: start -d
        BTRFS info (device dm-3): relocating block group 13631488 flags data
        BTRFS warning (device dm-3): csum failed root -9 ino 257 off 4096 csum 0x98757625 expected csum 0x00000000 mirror 1
        BTRFS error (device dm-3): bdev /dev/mapper/arm_nvme-test errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
        BTRFS warning (device dm-3): csum failed root -9 ino 257 off 4096 csum 0x98757625 expected csum 0x00000000 mirror 1
        BTRFS error (device dm-3): bdev /dev/mapper/arm_nvme-test errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
        BTRFS info (device dm-3): balance: ended with status: -5
      
      The minimal script to reproduce looks like this:
      
        mkfs.btrfs -f -s 4k $dev
        mount $dev -o nospace_cache $mnt
        xfs_io -f -c "falloc 0 8k" $mnt/file
        xfs_io -f -c "pwrite 0 4k" $mnt/file
        btrfs balance start -d $mnt
      
      [CAUSE]
      Function btrfs_verify_data_csum() checks if the full range has
      EXTENT_NODATASUM bit for data reloc inode, if *all* bytes of the range
      have EXTENT_NODATASUM bit, then it skip the range.
      
      This works pretty well for regular sectorsize, as in that case
      btrfs_verify_data_csum() is called for each sector, thus no problem at
      all.
      
      But for subpage case, btrfs_verify_data_csum() is called on each bvec,
      which can contain several sectors, and since it checks *all* bytes for
      EXTENT_NODATASUM bit, if we have some range with csum, then we will
      continue checking all the sectors.
      
      For the preallocated sectors, it doesn't have any csum, thus obviously
      the csum won't match and cause the false alert.
      
      [FIX]
      Move the EXTENT_NODATASUM check into the main loop, so that we can check
      each sector for EXTENT_NODATASUM bit for subpage case.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e3c62324
    • Qu Wenruo's avatar
      btrfs: subpage: fix a potential use-after-free in writeback helper · 7c11d0ae
      Qu Wenruo authored
      [BUG]
      There is a possible use-after-free bug when running generic/095.
      
       BUG: Unable to handle kernel data access on write at 0x6b6b6b6b6b6b725b
       Faulting instruction address: 0xc000000000283654
       c000000000283078 do_raw_spin_unlock+0x88/0x230
       c0000000012b1e14 _raw_spin_unlock_irqrestore+0x44/0x90
       c000000000a918dc btrfs_subpage_clear_writeback+0xac/0xe0
       c0000000009e0458 end_bio_extent_writepage+0x158/0x270
       c000000000b6fd14 bio_endio+0x254/0x270
       c0000000009fc0f0 btrfs_end_bio+0x1a0/0x200
       c000000000b6fd14 bio_endio+0x254/0x270
       c000000000b781fc blk_update_request+0x46c/0x670
       c000000000b8b394 blk_mq_end_request+0x34/0x1d0
       c000000000d82d1c lo_complete_rq+0x11c/0x140
       c000000000b880a4 blk_complete_reqs+0x84/0xb0
       c0000000012b2ca4 __do_softirq+0x334/0x680
       c0000000001dd878 irq_exit+0x148/0x1d0
       c000000000016f4c do_IRQ+0x20c/0x240
       c000000000009240 hardware_interrupt_common_virt+0x1b0/0x1c0
      
      [CAUSE]
      There is very small race window like the following in generic/095.
      
      	Thread 1		|		Thread 2
      --------------------------------+------------------------------------
        end_bio_extent_writepage()	| btrfs_releasepage()
        |- spin_lock_irqsave()	| |
        |- end_page_writeback()	| |
        |				| |- if (PageWriteback() ||...)
        |				| |- clear_page_extent_mapped()
        |				|    |- kfree(subpage);
        |- spin_unlock_irqrestore().
      
      The race can also happen between writeback and btrfs_invalidatepage(),
      although that would be much harder as btrfs_invalidatepage() has much
      more work to do before the clear_page_extent_mapped() call.
      
      [FIX]
      Here we "wait" for the subapge spinlock to be released before we detach
      subpage structure.
      So this patch will introduce a new function, wait_subpage_spinlock(), to
      do the "wait" by acquiring the spinlock and release it.
      
      Since the caller has ensured the page is not dirty nor writeback, and
      page is already locked, the only way to hold the subpage spinlock is
      from endio function.
      Thus we only need to acquire the spinlock to wait for any existing
      holder.
      Reported-by: default avatarRitesh Harjani <riteshh@linux.ibm.com>
      Tested-by: default avatarRitesh Harjani <riteshh@linux.ibm.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7c11d0ae
    • Qu Wenruo's avatar
      btrfs: subpage: fix race between prepare_pages() and btrfs_releasepage() · e0467866
      Qu Wenruo authored
      [BUG]
      When running generic/095, there is a high chance to crash with subpage
      data RW support:
      
       assertion failed: PagePrivate(page) && page->private
       ------------[ cut here ]------------
       kernel BUG at fs/btrfs/ctree.h:3403!
       Internal error: Oops - BUG: 0 [#1] SMP
       CPU: 1 PID: 3567 Comm: fio Tainted: 5.12.0-rc7-custom+ #17
       Hardware name: Khadas VIM3 (DT)
       Call trace:
        assertfail.constprop.0+0x28/0x2c [btrfs]
        btrfs_subpage_assert+0x80/0xa0 [btrfs]
        btrfs_subpage_set_uptodate+0x34/0xec [btrfs]
        btrfs_page_clamp_set_uptodate+0x74/0xa4 [btrfs]
        btrfs_dirty_pages+0x160/0x270 [btrfs]
        btrfs_buffered_write+0x444/0x630 [btrfs]
        btrfs_direct_write+0x1cc/0x2d0 [btrfs]
        btrfs_file_write_iter+0xc0/0x160 [btrfs]
        new_sync_write+0xe8/0x180
        vfs_write+0x1b4/0x210
        ksys_pwrite64+0x7c/0xc0
        __arm64_sys_pwrite64+0x24/0x30
        el0_svc_common.constprop.0+0x70/0x140
        do_el0_svc+0x28/0x90
        el0_svc+0x2c/0x54
        el0_sync_handler+0x1a8/0x1ac
        el0_sync+0x170/0x180
       Code: f0000160 913be042 913c4000 955444bc (d4210000)
       ---[ end trace 3fdd39f4cccedd68 ]---
      
      [CAUSE]
      Although prepare_pages() calls find_or_create_page(), which returns the
      page locked, but in later prepare_uptodate_page() calls, we may call
      btrfs_readpage() which will unlock the page before it returns.
      
      This leaves a window where btrfs_releasepage() can sneak in and release
      the page, clearing page->private and causing above ASSERT().
      
      [FIX]
      In prepare_uptodate_page(), we should not only check page->mapping, but
      also PagePrivate() to ensure we are still holding the correct page which
      has proper fs context setup.
      Reported-by: default avatarRitesh Harjani <riteshh@linux.ibm.com>
      Tested-by: default avatarRitesh Harjani <riteshh@linux.ibm.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e0467866
    • Qu Wenruo's avatar
      btrfs: subpage: reject raid56 filesystem and profile conversion · c8050b3b
      Qu Wenruo authored
      RAID56 is not only unsafe due to its write-hole problem, but also has
      tons of hardcoded PAGE_SIZE.
      
      Disable it for subpage support for now.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c8050b3b
    • Qu Wenruo's avatar
      btrfs: subpage: allow submit_extent_page() to do bio split · e0eefe07
      Qu Wenruo authored
      Current submit_extent_page() just checks if the current page range can
      be fitted into current bio, and if not, submit then re-add.
      
      But this behavior can't handle subpage case at all.
      
      For subpage case, the problem is in the page size, 64K, which is also
      the same size as stripe size.
      
      This means, if we can't fit a full 64K into a bio, due to stripe limit,
      then it won't fit into next bio without crossing stripe either.
      
      The proper way to handle it is:
      
      - Check how many bytes we can be put into current bio
      - Put as many bytes as possible into current bio first
      - Submit current bio
      - Create a new bio
      - Add the remaining bytes into the new bio
      
      Refactor submit_extent_page() so that it does the above iteration.
      
      The main loop inside submit_extent_page() will look like this:
      
      	cur = pg_offset;
      	while (cur < pg_offset + size) {
      		u32 offset = cur - pg_offset;
      		int added;
      		if (!bio_ctrl->bio) {
      			/* Allocate new bio if needed */
      		}
      
      		/* Add as many bytes into the bio */
      		added = btrfs_bio_add_page();
      
      		if (added < size - offset) {
      			/* The current bio is full, submit it */
      		}
      		cur += added;
      	}
      
      Also, since we're doing new bio allocation deep inside the main loop,
      extract that code into a new helper, alloc_new_bio().
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e0eefe07
    • Qu Wenruo's avatar
      btrfs: subpage: disable inline extent creation · 7367253a
      Qu Wenruo authored
      [BUG]
      When running the following fsx command (extracted from generic/127) on
      subpage filesystem, it can create inline extent with regular extents:
      
        fsx -q -l 262144 -o 65536 -S 191110531 -N 9057 -R -W $mnt/file > /tmp/fsx
      
      The offending extent would look like:
      
        item 9 key (257 INODE_REF 256) itemoff 15703 itemsize 14
          index 2 namelen 4 name: file
        item 10 key (257 EXTENT_DATA 0) itemoff 14975 itemsize 728
          generation 7 type 0 (inline)
          inline extent data size 707 ram_bytes 707 compression 0 (none)
        item 11 key (257 EXTENT_DATA 4096) itemoff 14922 itemsize 53
          generation 7 type 2 (prealloc)
          prealloc data disk byte 102346752 nr 4096
          prealloc data offset 0 nr 4096
      
      [CAUSE]
      For subpage filesystem, the writeback is triggered in page units, which
      means, even if we just want to writeback range [16K, 20K) for 64K page
      system, we will still try to writeback any dirty sector of range [0, 64K).
      
      This is never a problem if sectorsize == PAGE_SIZE, but for subpage,
      this can cause unexpected problems.
      
      For above test case, the last several operations from fsx are:
      
       9055 trunc      from 0x40000 to 0x2c3
       9057 falloc     from 0x164c to 0x19d2 (0x386 bytes)
      
      In operation 9055, we dirtied sector [0, 4096), then in falloc, we call
      btrfs_wait_ordered_range(inode, start=4096, len=4096), only expecting to
      writeback any dirty data in [4096, 8192), but nothing else.
      
      Unfortunately, in subpage case, above btrfs_wait_ordered_range() will
      trigger writeback of the range [0, 64K), which includes the data at
      [0, 4096).
      
      And since at the call site, we haven't yet increased i_size, which is
      still 707, this means cow_file_range() can insert an inline extent.
      
      Resulting above inline + regular extent.
      
      [WORKAROUND]
      I don't really have any good short-term solution yet, as this means all
      operations that would trigger writeback need to be reviewed for any
      i_size change.
      
      So here I choose to disable inline extent creation for subpage case as a
      workaround.  We have done tons of work just to avoid such extent, so I
      don't to create an exception just for subpage.
      
      This only affects inline extent creation, subpage has no problem reading
      existing inline extents at all.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7367253a
    • Qu Wenruo's avatar
      btrfs: subpage: fix writeback which does not have ordered extent · cc1d0d93
      Qu Wenruo authored
      [BUG]
      When running fsstress with subpage RW support, there are random
      BUG_ON()s triggered with the following trace:
      
       kernel BUG at fs/btrfs/file-item.c:667!
       Internal error: Oops - BUG: 0 [#1] SMP
       CPU: 1 PID: 3486 Comm: kworker/u13:2 5.11.0-rc4-custom+ #43
       Hardware name: Radxa ROCK Pi 4B (DT)
       Workqueue: btrfs-worker-high btrfs_work_helper [btrfs]
       pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--)
       pc : btrfs_csum_one_bio+0x420/0x4e0 [btrfs]
       lr : btrfs_csum_one_bio+0x400/0x4e0 [btrfs]
       Call trace:
        btrfs_csum_one_bio+0x420/0x4e0 [btrfs]
        btrfs_submit_bio_start+0x20/0x30 [btrfs]
        run_one_async_start+0x28/0x44 [btrfs]
        btrfs_work_helper+0x128/0x1b4 [btrfs]
        process_one_work+0x22c/0x430
        worker_thread+0x70/0x3a0
        kthread+0x13c/0x140
        ret_from_fork+0x10/0x30
      
      [CAUSE]
      Above BUG_ON() means there is some bio range which doesn't have ordered
      extent, which indeed is worth a BUG_ON().
      
      Unlike regular sectorsize == PAGE_SIZE case, in subpage we have extra
      subpage dirty bitmap to record which range is dirty and should be
      written back.
      
      This means, if we submit bio for a subpage range, we do not only need to
      clear page dirty, but also need to clear subpage dirty bits.
      
      In __extent_writepage_io(), we will call btrfs_page_clear_dirty() for
      any range we submit a bio.
      
      But there is loophole, if we hit a range which is beyond i_size, we just
      call btrfs_writepage_endio_finish_ordered() to finish the ordered io,
      then break out, without clearing the subpage dirty.
      
      This means, if we hit above branch, the subpage dirty bits are still
      there, if other range of the page get dirtied and we need to writeback
      that page again, we will submit bio for the old range, leaving a wild
      bio range which doesn't have ordered extent.
      
      [FIX]
      Fix it by always calling btrfs_page_clear_dirty() in
      __extent_writepage_io().
      
      Also to avoid such problem from happening again, add a new assert,
      btrfs_page_assert_not_dirty(), to make sure both page dirty and subpage
      dirty bits are cleared before exiting __extent_writepage_io().
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cc1d0d93
    • Qu Wenruo's avatar
      btrfs: make relocate_one_page() handle subpage case · c2832898
      Qu Wenruo authored
      For subpage case, one page of data reloc inode can contain several file
      extents, like this:
      
      |<--- File extent A --->| FE B | FE C |<--- File extent D -->|
      		|<--------- Page --------->|
      
      We can no longer use PAGE_SIZE directly for various operations.
      
      This patch will relocate_one_page() to handle subpage case by:
      - Iterating through all extents of a cluster when marking pages
        When marking pages dirty and delalloc, we need to check the cluster
        extent boundary.
        Now we introduce a loop to go extent by extent of a page, until we
        either finished the last extent, or reach the page end.
      
        By this, regular sectorsize == PAGE_SIZE can still work as usual, since
        we will do that loop only once.
      
      - Iteration start from max(page_start, extent_start)
        Since we can have the following case:
      			| FE B | FE C |<--- File extent D -->|
      		|<--------- Page --------->|
        Thus we can't always start from page_start, but do a
        max(page_start, extent_start)
      
      - Iteration end when the cluster is exhausted
        Similar to previous case, the last file extent can end before the page
        end:
      |<--- File extent A --->| FE B | FE C |
      		|<--------- Page --------->|
        In this case, we need to manually exit the loop after we have finished
        the last extent of the cluster.
      
      - Reserve metadata space for each extent range
        Since now we can hit multiple ranges in one page, we should reserve
        metadata for each range, not simply PAGE_SIZE.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c2832898
    • Qu Wenruo's avatar
      btrfs: reloc: factor out relocation page read and dirty part · f47960f4
      Qu Wenruo authored
      In function relocate_file_extent_cluster(), we have a big loop for
      marking all involved page delalloc.
      
      That part is long enough to be contained in one function, so this patch
      will move that code chunk into a new function, relocate_one_page().
      
      This also provides enough space for later subpage work.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f47960f4
    • Qu Wenruo's avatar
      btrfs: rework lzo_decompress_bio() to make it subpage compatible · a6e66e6f
      Qu Wenruo authored
      For the initial subpage support, although we won't support compressed
      write, we still need to support compressed read.
      
      But for lzo_decompress_bio() it has several problems:
      
      - The abuse of PAGE_SIZE for boundary detection
        For subpage case, we should follow sectorsize to detect the padding
        zeros.
        Using PAGE_SIZE will cause subpage compress read to skip certain
        bytes, and causing read error.
      
      - Too many helper variables
        There are half a dozen helper variables, which is only making things
        harder to read
      
      This patch will rework lzo_decompress_bio() to make it work for subpage:
      
      - Use sectorsize to do boundary check, while still use PAGE_SIZE for
        page switching
        This allows us to have the same on-disk format for 4K sectorsize fs,
        while take advantage of larger page size.
      
      - Use two main cursors
        Only @cur_in and @cur_out is utilized as the main cursor.
        The helper variables will only be declared inside the loop, and only 2
        helper variables needed.
      
      - Introduce a helper function to copy compressed segment payload
        Introduce a new helper, copy_compressed_segment(), to copy a
        compressed segment to workspace buffer.
        This function will handle the page switching.
      
      Now the net result is, with all the excessive comments and new helper
      function, the refactored code is still smaller, and easier to read.
      
      For other decompression code, they have no special padding rule, thus no
      need to bother for initial subpage support, but will be refactored to
      the same style later.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a6e66e6f
    • Qu Wenruo's avatar
      btrfs: rework btrfs_decompress_buf2page() · 1c3dc173
      Qu Wenruo authored
      There are several bugs inside the function btrfs_decompress_buf2page()
      
      - @start_byte doesn't take bvec.bv_offset into consideration
        Thus it can't handle case where the target range is not page aligned.
      
      - Too many helper variables
        There are tons of helper variables, @buf_offset, @current_buf_start,
        @start_byte, @prev_start_byte, @working_bytes, @bytes.
        This hurts anyone who wants to read the function.
      
      - No obvious main cursor for the iteartion
        A new problem caused by previous problem.
      
      - Comments for parameter list makes no sense
        Like @buf_start is the offset to @buf, or offset inside the full
        decompressed extent? (Spoiler alert, the later case)
        And @total_out acts more like @buf_start + @size_of_buf.
      
        The worst is @disk_start.
        The real meaning of it is the file offset of the full decompressed
        extent.
      
      This patch will rework the whole function by:
      
      - Add a proper comment with ASCII art to explain the parameter list
      
      - Rework parameter list
        The old @buf_start is renamed to @decompressed, to show how many bytes
        are already decompressed inside the full decompressed extent.
        The old @total_out is replaced by @buf_len, which is the decompressed
        data size.
        For old @disk_start and @bio, just pass @compressed_bio in.
      
      - Use single main cursor
        The main cursor will be @cur_file_offset, to show what's the current
        file offset.
        Other helper variables will be declared inside the main loop, and only
        minimal amount of helper variables:
        * offset_inside_decompressed_buf:	The only real helper
        * copy_start_file_offset:		File offset we start memcpy
        * bvec_file_offset:			File offset of current bvec
      
      Even with all these extensive comments, the final function is still
      smaller than the original function, which is definitely a win.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1c3dc173
    • Qu Wenruo's avatar
      btrfs: grab correct extent map for subpage compressed extent read · 557023ea
      Qu Wenruo authored
      [BUG]
      When subpage compressed read write support is enabled, btrfs/038 always
      fails with EIO.
      
      A simplified script can easily trigger the problem:
      
        mkfs.btrfs -f -s 4k $dev
        mount $dev $mnt -o compress=lzo
      
        xfs_io -f -c "truncate 118811" $mnt/foo
        xfs_io -c "pwrite -S 0x0d -b 39987 92267 39987" $mnt/foo > /dev/null
      
        sync
        btrfs subvolume snapshot -r $mnt $mnt/mysnap1
      
        xfs_io -c "pwrite -S 0x3e -b 80000 200000 80000" $mnt/foo > /dev/null
        sync
      
        xfs_io -c "pwrite -S 0xdc -b 10000 250000 10000" $mnt/foo > /dev/null
        xfs_io -c "pwrite -S 0xff -b 10000 300000 10000" $mnt/foo > /dev/null
      
        sync
        btrfs subvolume snapshot -r $mnt $mnt/mysnap2
      
        cat $mnt/mysnap2/foo
        # Above cat will fail due to EIO
      
      [CAUSE]
      The problem is in btrfs_submit_compressed_read().
      
      When it tries to grab the extent map of the read range, it uses the
      following call:
      
      	em = lookup_extent_mapping(em_tree,
        				   page_offset(bio_first_page_all(bio)),
      				   fs_info->sectorsize);
      
      The problem is in the page_offset(bio_first_page_all(bio)) part.
      
      The offending inode has the following file extent layout
      
              item 10 key (257 EXTENT_DATA 131072) itemoff 15639 itemsize 53
                      generation 8 type 1 (regular)
                      extent data disk byte 13680640 nr 4096
                      extent data offset 0 nr 4096 ram 4096
                      extent compression 0 (none)
              item 11 key (257 EXTENT_DATA 135168) itemoff 15586 itemsize 53
                      generation 8 type 1 (regular)
                      extent data disk byte 0 nr 0
              item 12 key (257 EXTENT_DATA 196608) itemoff 15533 itemsize 53
                      generation 8 type 1 (regular)
                      extent data disk byte 13676544 nr 4096
                      extent data offset 0 nr 53248 ram 86016
                      extent compression 2 (lzo)
      
      And the bio passed in has the following parameters:
      
      page_offset(bio_first_page_all(bio))	= 131072
      bio_first_bvec_all(bio)->bv_offset	= 65536
      
      If we use page_offset(bio_first_page_all(bio) without adding bv_offset,
      we will get an extent map for file offset 131072, not 196608.
      
      This means we read uncompressed data from disk, and later decompression
      will definitely fail.
      
      [FIX]
      Take bv_offset into consideration when trying to grab an extent map.
      
      And add an ASSERT() to ensure we're really getting a compressed extent.
      
      Thankfully this won't affect anything but subpage, thus we only need to
      ensure this patch get merged before we enabled basic subpage support.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      557023ea
    • Qu Wenruo's avatar
      btrfs: disable compressed readahead for subpage · ca62e85d
      Qu Wenruo authored
      For current subpage support, we only support 64K page size with 4K
      sector size.
      
      This makes compressed readahead less effective, as maximum compressed
      extent size is only 128K, 2x the page size.
      
      On the other hand, the function add_ra_bio_pages() is still assuming
      sectorsize == PAGE_SIZE, and code change may affect 4K page size
      systems.
      
      So for now, let's disable subpage compressed readahead for now.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ca62e85d
    • Qu Wenruo's avatar
      btrfs: subpage: check if there are compressed extents inside one page · 3670e645
      Qu Wenruo authored
      [BUG]
      When testing experimental subpage compressed write support, it hits a
      NULL pointer dereference inside read path:
      
       Unable to handle kernel NULL pointer dereference at virtual address 0000000000000018
       pc : __pi_memcmp+0x28/0x1ec
       lr : check_data_csum+0xd0/0x274 [btrfs]
       Call trace:
        __pi_memcmp+0x28/0x1ec
        btrfs_verify_data_csum+0xf4/0x244 [btrfs]
        end_bio_extent_readpage+0x1d0/0x6b0 [btrfs]
        bio_endio+0x15c/0x1dc
        end_workqueue_fn+0x44/0x64 [btrfs]
        btrfs_work_helper+0x74/0x250 [btrfs]
        process_one_work+0x1d4/0x47c
        worker_thread+0x180/0x400
        kthread+0x11c/0x120
        ret_from_fork+0x10/0x30
       Code: 54000261 d100044c d343fd8c f8408403 (f8408424)
       ---[ end trace 9e2c59f33ea40866 ]---
      
      [CAUSE]
      When reading two compressed extents inside the same page, like the
      following layout, we trigger above crash:
      
      	0	32K	64K
      	|-------|\\\\\\\|
      	     |	     \- Compressed extent (A)
      	     \--------- Compressed extent (B)
      
      For compressed read, we don't need to populate its io_bio->csum, as we
      rely on compressed_bio->csum to verify the compressed data, and then
      copy the decompressed to inode pages.
      
      Normally btrfs_verify_data_csum() skip such page by checking and
      clearing its PageChecked flag
      
      But since that flag is still for the full page, when endio for inode
      page range [0, 32K) gets executed, it clears PageChecked flag for the
      full page.
      
      Then when endio for inode page range [32K, 64K) gets executed, since the
      page no longer has PageChecked flag, it just continues checking, even
      though io_bio->csum is NULL.
      
      [FIX]
      Thankfully there are only two users of PageChecked bit:
      
      - Cow fixup
        Since subpage has its own way to trace page dirty (dirty_bitmap) and
        ordered bit (ordered_bitmap), it should never trigger cow fixup.
      
      - Compressed read
        We can distinguish such read by just checking io_bio->csum.
      
      So just check io_bio->csum before doing the verification to avoid such
      NULL pointer dereference.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3670e645
    • Qu Wenruo's avatar
      btrfs: reset this_bio_flag to avoid inheriting old flags · 4c37a793
      Qu Wenruo authored
      In btrfs_do_readpage(), we never reset @this_bio_flag after we hit a
      compressed extent.
      
      This is fine, as for PAGE_SIZE == sectorsize case, we can only have one
      sector for one page, thus @this_bio_flag will only be set at most once.
      
      But for subpage case, after hitting a compressed extent, @this_bio_flag
      will always have EXTENT_BIO_COMPRESSED bit, even we're reading a regular
      extent.
      
      This will lead to various read errors, and causing new ASSERT() in
      incoming subpage patches, which adds more strict check in
      btrfs_submit_compressed_read().
      
      Fix it by declaring @this_bio_flag inside the main loop and reset its
      value for each iteration.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4c37a793
    • David Sterba's avatar
      btrfs: constify and cleanup variables in comparators · 214cc184
      David Sterba authored
      Comparators just read the data and thus get const parameters. This
      should be also preserved by the local variables, update all comparators
      passed to sort or bsearch.
      
      Cleanups:
      
      - unnecessary casts are dropped
      - btrfs_cmp_device_free_bytes is cleaned up to follow the common pattern
        and 'inline' is dropped as the function address is taken
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      214cc184
    • David Sterba's avatar
      btrfs: simplify data stripe calculation helpers · d58ede8d
      David Sterba authored
      There are two helpers doing the same calculations based on nparity and
      ncopies. calc_data_stripes can be simplified into one expression, so far
      we don't have profile with both copies and parity, so there's no
      effective change. calc_stripe_length should reuse the helper and not
      repeat the same calculation.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d58ede8d
    • David Sterba's avatar
      btrfs: merge alloc_device helpers · fe4f46d4
      David Sterba authored
      The device allocation is split to two functions, but one just calls the
      other and they're very far in the file. Merge them together.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fe4f46d4
    • David Sterba's avatar
      btrfs: uninline btrfs_bg_flags_to_raid_index · 500a44c9
      David Sterba authored
      The helper does a simple translation from block group flags to index to
      the btrfs_raid_array table. There's no apparent reason to inline the
      function, the translation happens usually once per function and is not
      called in a loop.
      
      Making it a proper function saves quite some binary code (x86_64,
      release config):
      
         text    data     bss     dec     hex filename
      1164011   19253   14912 1198176  124860 pre/btrfs.ko
      1161559   19253   14912 1195724  123ecc post/btrfs.ko
      
      DELTA: -2451
      
      Also add the const attribute as there are no side effects, this could
      help compiler to optimize a few things without the function body.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      500a44c9
    • David Sterba's avatar
      btrfs: tree-checker: add missing stripe checks for raid1c3/4 profiles · 6c154ba4
      David Sterba authored
      The stripe checks for raid1c3/raid1c4 are missing in the sequence in
      btrfs_check_chunk_valid.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6c154ba4
    • David Sterba's avatar
      btrfs: tree-checker: use table values for stripe checks · 0ac6e06b
      David Sterba authored
      There are hardcoded values in several checks regarding chunks and stripe
      constraints. We have that defined in the raid table and ought to use it.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0ac6e06b
    • David Sterba's avatar
      btrfs: make btrfs_next_leaf static inline · 809d6902
      David Sterba authored
      btrfs_next_leaf is a simple wrapper for btrfs_next_old_leaf so move it
      to header to avoid the function call overhead.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      809d6902
    • David Sterba's avatar
      btrfs: remove uptodate parameter from btrfs_dec_test_first_ordered_pending · f41b6ba9
      David Sterba authored
      In commit e65f152e ("btrfs: refactor how we finish ordered extent io
      for endio functions") there was last caller not using 1 for the uptodate
      parameter. Now there's only one, passing 1, so we can remove it and
      simplify the code.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f41b6ba9
    • David Sterba's avatar
      btrfs: switch uptodate to bool in btrfs_writepage_endio_finish_ordered · 25c1252a
      David Sterba authored
      The uptodate parameter should be bool, change the type.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      25c1252a
    • Qu Wenruo's avatar
      btrfs: remove unused start and end parameters from btrfs_run_delalloc_range() · a129ffb8
      Qu Wenruo authored
      Since commit d75855b4 ("btrfs: Remove
      extent_io_ops::writepage_start_hook") removes the writepage_start_hook()
      and adds btrfs_writepage_cow_fixup() function, there is no need to
      follow the old hook parameters.
      
      Remove the @start and @end hook, since currently the fixup check is full
      page check, it doesn't need @start and @end hook.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a129ffb8
    • Marcos Paulo de Souza's avatar
      btrfs: introduce btrfs_lookup_match_dir · a7d1c5dc
      Marcos Paulo de Souza authored
      btrfs_search_slot is called in multiple places in dir-item.c to search
      for a dir entry, and then calling btrfs_match_dir_name to return a
      btrfs_dir_item.
      
      In order to reduce the number of callers of btrfs_search_slot, create a
      common function that looks for the dir key, and if found call
      btrfs_match_dir_item_name.
      Signed-off-by: default avatarMarcos Paulo de Souza <mpdesouza@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a7d1c5dc
    • Marcos Paulo de Souza's avatar
      btrfs: remove unneeded return variable in btrfs_lookup_file_extent · f8ee80de
      Marcos Paulo de Souza authored
      We can return from btrfs_search_slot directly which also shows that it
      follows the same return value convention.
      Signed-off-by: default avatarMarcos Paulo de Souza <mpdesouza@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f8ee80de
    • Marcos Paulo de Souza's avatar
      btrfs: use btrfs_next_leaf instead of btrfs_next_item when slots > nritems · ad9a9378
      Marcos Paulo de Souza authored
      After calling btrfs_search_slot is a common practice to check if the
      slot found isn't bigger than number of slots in the current leaf, and if
      so, search for the same key in the next leaf by calling btrfs_next_leaf,
      which calls btrfs_next_old_leaf to do the job.
      
      Calling btrfs_next_item in the same situation would end up in the same
      code flow, since
      
      * btrfs_next_item
        * btrfs_next_old_item
          * if slot >= nritems(curr_leaf)
            btrfs_next_old_leaf
      
      Change btrfs_verify_dev_extents and calculate_emulated_zone_size
      functions to use btrfs_next_leaf in the same situation.
      Signed-off-by: default avatarMarcos Paulo de Souza <mpdesouza@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ad9a9378
    • Filipe Manana's avatar
      btrfs: remove ignore_offset argument from btrfs_find_all_roots() · c7bcbb21
      Filipe Manana authored
      Currently all the callers of btrfs_find_all_roots() pass a value of false
      for its ignore_offset argument. This makes the argument pointless and we
      can remove it and make btrfs_find_all_roots() always pass false as the
      ignore_offset argument for btrfs_find_all_roots_safe(). So just do that.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c7bcbb21
    • Filipe Manana's avatar
      btrfs: avoid unnecessary lock and leaf splits when updating inode in the log · 2ac691d8
      Filipe Manana authored
      During a fast fsync, if we have already fsynced the file before and in the
      current transaction, we can make the inode item update more efficient and
      avoid acquiring a write lock on the leaf's parent.
      
      To update the inode item we are always using btrfs_insert_empty_item() to
      get a path pointing to the inode item, which calls btrfs_search_slot()
      with an "ins_len" argument of 'sizeof(struct btrfs_inode_item) +
      sizeof(struct btrfs_item)', and that always results in the search taking
      a write lock on the level 1 node that is the parent of the leaf that
      contains the inode item. This adds unnecessary lock contention on log
      trees when we have multiple fsyncs in parallel against inodes in the same
      subvolume, which has a very significant impact due to the fact that log
      trees are short lived and their height very rarely goes beyond level 2.
      
      Also, by using btrfs_insert_empty_item() when we need to update the inode
      item, we also end up splitting the leaf of the existing inode item when
      the leaf has an amount of free space smaller than the size of an inode
      item.
      
      Improve this by using btrfs_seach_slot(), with a 0 "ins_len" argument,
      when we know the inode item already exists in the log. This avoids these
      two inefficiencies.
      
      The following script, using fio, was used to perform the tests:
      
        $ cat fio-test.sh
        #!/bin/bash
      
        DEV=/dev/nvme0n1
        MNT=/mnt/nvme0n1
        MOUNT_OPTIONS="-o ssd"
        MKFS_OPTIONS="-d single -m single"
      
        if [ $# -ne 4 ]; then
          echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ BLOCK_SIZE"
          exit 1
        fi
      
        NUM_JOBS=$1
        FILE_SIZE=$2
        FSYNC_FREQ=$3
        BLOCK_SIZE=$4
      
        cat <<EOF > /tmp/fio-job.ini
        [writers]
        rw=randwrite
        fsync=$FSYNC_FREQ
        fallocate=none
        group_reporting=1
        direct=0
        bs=$BLOCK_SIZE
        ioengine=sync
        size=$FILE_SIZE
        directory=$MNT
        numjobs=$NUM_JOBS
        EOF
      
        echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
        echo
        echo "Using config:"
        echo
        cat /tmp/fio-job.ini
        echo
        echo "mount options: $MOUNT_OPTIONS"
        echo
      
        umount $MNT &> /dev/null
        mkfs.btrfs -f $MKFS_OPTIONS $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
      
        fio /tmp/fio-job.ini
        umount $MNT
      
      The tests were done on a physical machine, with 12 cores, 64G of RAM,
      using a NVMEe device and using a non-debug kernel config (the default one
      from Debian). The summary line from fio is provided below for each test
      run.
      
      With 8 jobs, file size 256M, fsync frequency of 4 and a block size of 4K:
      
      Before: WRITE: bw=28.3MiB/s (29.7MB/s), 28.3MiB/s-28.3MiB/s (29.7MB/s-29.7MB/s), io=2048MiB (2147MB), run=72297-72297msec
      After:  WRITE: bw=28.7MiB/s (30.1MB/s), 28.7MiB/s-28.7MiB/s (30.1MB/s-30.1MB/s), io=2048MiB (2147MB), run=71411-71411msec
      
      +1.4% throughput, -1.2% runtime
      
      With 16 jobs, file size 256M, fsync frequency of 4 and a block size of 4K:
      
      Before: WRITE: bw=40.0MiB/s (42.0MB/s), 40.0MiB/s-40.0MiB/s (42.0MB/s-42.0MB/s), io=4096MiB (4295MB), run=99980-99980msec
      After:  WRITE: bw=40.9MiB/s (42.9MB/s), 40.9MiB/s-40.9MiB/s (42.9MB/s-42.9MB/s), io=4096MiB (4295MB), run=97933-97933msec
      
      +2.2% throughput, -2.1% runtime
      
      The changes are small but it's possible to be better on faster hardware as
      in the test machine used disk utilization was pretty much 100% during the
      whole time the tests were running (observed with 'iostat -xz 1').
      
      The tests also included the previous patch with the subject of:
      "btrfs: avoid unnecessary log mutex contention when syncing log".
      So they compared a branch without that patch and without this patch versus
      a branch with these two patches applied.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2ac691d8
    • Filipe Manana's avatar
      btrfs: remove unnecessary list head initialization when syncing log · e68107e5
      Filipe Manana authored
      One of the last steps of syncing the log is to remove all log contexts
      from the root's list of contexts, done at btrfs_remove_all_log_ctxs().
      There we iterate over all the contexts in the list and delete each one
      from the list, and after that we call INIT_LIST_HEAD() on the list. That
      is unnecessary since at that point the list is empty.
      
      So just remove the INIT_LIST_HEAD() call. It's not needed, increases code
      size (bloat-o-meter reported a delta of -122 for btrfs_sync_log() after
      this change) and increases two critical sections delimited by log mutexes.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e68107e5
    • Filipe Manana's avatar
      btrfs: avoid unnecessary log mutex contention when syncing log · e1a6d264
      Filipe Manana authored
      When syncing the log we acquire the root's log mutex just to update the
      root's last_log_commit. This is unnecessary because:
      
      1) At this point there can only be one task updating this value, which is
         the task committing the current log transaction. Any task that enters
         btrfs_sync_log() has to wait for the previous log transaction to commit
         and wait for the current log transaction to commit if someone else
         already started it (in this case it never reaches to the point of
         updating last_log_commit, as that is done by the committing task);
      
      2) All readers of the root's last_log_commit don't acquire the root's
         log mutex. This is to avoid blocking the readers, potentially for too
         long and because getting a stale value of last_log_commit does not
         cause any functional problem, in the worst case getting a stale value
         results in logging an inode unnecessarily. Plus it's actually very
         rare to get a stale value that results in unnecessarily logging the
         inode.
      
      So in order to avoid unnecessary contention on the root's log mutex,
      which is used for several different purposes, like starting/joining a
      log transaction and starting writeback of a log transaction, stop
      acquiring the log mutex for updating the root's last_log_commit.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e1a6d264
    • Filipe Manana's avatar
      btrfs: remove racy and unnecessary inode transaction update when using no-holes · cceaa89f
      Filipe Manana authored
      When using the NO_HOLES feature and expanding the size of an inode, we
      update the inode's last_trans, last_sub_trans and last_log_commit fields
      at maybe_insert_hole() so that a fsync does know that the inode needs to
      be logged (by making sure that btrfs_inode_in_log() returns false). This
      happens for expanding truncate operations, buffered writes, direct IO
      writes and when cloning extents to an offset greater than the inode's
      i_size.
      
      However the way we do it is racy, because in between setting the inode's
      last_sub_trans and last_log_commit fields, the log transaction ID that was
      assigned to last_sub_trans might be committed before we read the root's
      last_log_commit and assign that value to last_log_commit. If that happens
      it would make a future call to btrfs_inode_in_log() return true. This is
      a race that should be extremely unlikely to be hit in practice, and it is
      the same that was described by commit bc0939fc ("btrfs: fix race
      between marking inode needs to be logged and log syncing").
      
      The fix would simply be to set last_log_commit to the value we assigned
      to last_sub_trans minus 1, like it was done in that commit. However
      updating these two fields plus the last_trans field is pointless here
      because all the callers of btrfs_cont_expand() (which is the only
      caller of maybe_insert_hole()) always call btrfs_set_inode_last_trans()
      or btrfs_update_inode() after calling btrfs_cont_expand(). Calling either
      btrfs_set_inode_last_trans() or btrfs_update_inode() guarantees that the
      next fsync will log the inode, as it makes btrfs_inode_in_log() return
      false.
      
      So just remove the code that explicitly sets the inode's last_trans,
      last_sub_trans and last_log_commit fields.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cceaa89f
    • Filipe Manana's avatar
      btrfs: stop doing GFP_KERNEL memory allocations in the ref verify tool · 5a656c36
      Filipe Manana authored
      In commit 351cbf6e ("btrfs: use nofs allocations for running delayed
      items") we wrapped all btree updates when running delayed items with
      memalloc_nofs_save() and memalloc_nofs_restore(), due to a lock inversion
      detected by lockdep involving reclaim and the mutex of delayed nodes.
      
      The problem is because the ref verify tool does some memory allocations
      with GFP_KERNEL, which can trigger reclaim and reclaim can trigger inode
      eviction, which requires locking the mutex of an inode's delayed node.
      On the other hand the ref verify tool is called when allocating metadata
      extents as part of operations that modify a btree, which is a problem when
      running delayed nodes, where we do btree updates while holding the mutex
      of a delayed node. This is what caused the lockdep warning.
      
      Instead of wrapping every btree update when running delayed nodes, change
      the ref verify tool to never do GFP_KERNEL allocations, because:
      
      1) We get less repeated code, which at the moment does not even have a
         comment mentioning why we need to setup the NOFS context, which is a
         recommended good practice as mentioned at
         Documentation/core-api/gfp_mask-from-fs-io.rst
      
      2) The ref verify tool is something meant only for debugging and not
         something that should be enabled on non-debug / non-development
         kernels;
      
      3) We may have yet more places outside delayed-inode.c where we have
         similar problem: doing btree updates while holding some lock and
         then having the GFP_KERNEL memory allocations, from the ref verify
         tool, trigger reclaim and trying again to acquire the same lock
         through the reclaim path.
         Or we could get more such cases in the future, therefore this change
         prevents getting into similar cases when using the ref verify tool.
      
      Curiously most of the memory allocations done by the ref verify tool
      were already using GFP_NOFS, except a few ones for no apparent reason.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5a656c36
    • Filipe Manana's avatar
      btrfs: improve the batch insertion of delayed items · 506650dc
      Filipe Manana authored
      When we insert the delayed items of an inode, which corresponds to the
      directory index keys for a directory (key type BTRFS_DIR_INDEX_KEY), we
      do the following:
      
      1) Pick the first delayed item from the rbtree and insert it into the
         fs/subvolume btree, using btrfs_insert_empty_item() for that;
      
      2) Without releasing the path returned by btrfs_insert_empty_item(),
         keep collecting as many consecutive delayed items from the rbtree
         as possible, as long as each one's BTRFS_DIR_INDEX_KEY key is the
         immediate successor of the previously picked item and as long as
         they fit in the available space of the leaf the path points to;
      
      3) Then insert all the collected items into the leaf;
      
      4) Release the reserve metadata space for each collected item and
         release each item (implies deleting from the rbtree);
      
      5) Unlock the path.
      
      While this is much better than inserting items one by one, it can be
      improved in a few aspects:
      
      1) Instead of adding items based on the remaining free space of the
         leaf, collect as many items that can fit in a leaf and bulk insert
         them. This results in less and larger batches, reducing the total
         amount of time to insert the delayed items. For example when adding
         100K files to a directory, we ended up creating 1658 batches with
         very variable sizes ranging from 1 item to 118 items, on a filesystem
         with a node/leaf size of 16K. After this change, we end up with 839
         batches, with the vast majority of them having exactly 120 items;
      
      2) We do the search for more items to batch, by iterating the rbtree,
         while holding a write lock on the leaf;
      
      3) While still holding the leaf locked, we are releasing the reserved
         metadata for each item and then deleting each item, keeping a write
         lock on the leaf for longer than necessary. Releasing the delayed items
         one by one can take a significant amount of time, because deleting
         them from the rbtree can often be a bit slow when the deletion results
         in rebalancing the rbtree.
      
      So change this so that we try to create larger batches, with a total
      item size up to the maximum a leaf can support, and by unlocking the leaf
      immediately after inserting the items, releasing the reserved metadata
      space of each item and releasing each item without holding the write lock
      on the leaf.
      
      The following script that runs fs_mark was used to test this change:
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/nvme0n1
        MNT=/mnt/nvme0n1
        MOUNT_OPTIONS="-o ssd"
        MKFS_OPTIONS="-m single -d single"
        FILES=1000000
        THREADS=16
        FILE_SIZE=0
      
        echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
        umount $DEV &> /dev/null
        mkfs.btrfs -f $MKFS_OPTIONS $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
      
        OPTS="-S 0 -L 5 -n $FILES -s $FILE_SIZE -t 16"
        for ((i = 1; i <= $THREADS; i++)); do
            OPTS="$OPTS -d $MNT/d$i"
        done
      
        fs_mark $OPTS
      
        umount $MNT
      
      It was run on machine with 12 cores, 64G of ram, using a NVMe device and
      using a non-debug kernel config (Debian's default config).
      
      Results before this change:
      
      FSUse%        Count         Size    Files/sec         App Overhead
           1     16000000            0      76182.1             72223046
           3     32000000            0      62746.9             80776528
           5     48000000            0      77029.0             93022381
           6     64000000            0      73691.6             95251075
           8     80000000            0      66288.0             85089634
      
      Results after this change:
      
      FSUse%        Count         Size    Files/sec         App Overhead
           1     16000000            0      79049.5 (+3.7%)     69700824
           3     32000000            0      65248.9 (+3.9%)     80583693
           5     48000000            0      77991.4 (+1.2%)     90040908
           6     64000000            0      75096.8 (+1.9%)     89862241
           8     80000000            0      66926.8 (+1.0%)     84429169
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      506650dc
    • Qu Wenruo's avatar
      btrfs: rescue: allow ibadroots to skip bad extent tree when reading block group items · 2b29726c
      Qu Wenruo authored
      When extent tree gets corrupted, normally it's not extent tree root, but
      one toasted tree leaf/node.
      
      In that case, rescue=ibadroots mount option won't help as it can only
      handle the extent tree root corruption.
      
      This patch will enhance the behavior by:
      
      - Allow fill_dummy_bgs() to ignore -EEXIST error
      
        This means we may have some block group items read from disk, but
        then hit some error halfway.
      
      - Fallback to fill_dummy_bgs() if any error gets hit in
        btrfs_read_block_groups()
      
        Of course, this still needs rescue=ibadroots mount option.
      
      With that, rescue=ibadroots can handle extent tree corruption more
      gracefully and allow a better recover chance.
      Reported-by: default avatarZhenyu Wu <wuzy001@gmail.com>
      Link: https://www.spinics.net/lists/linux-btrfs/msg114424.htmlReviewed-by: default avatarSu Yue <l@damenly.su>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2b29726c
    • Marcos Paulo de Souza's avatar
      btrfs: pass NULL as trans to btrfs_search_slot if we only want to search · 6534c0c9
      Marcos Paulo de Souza authored
      Using a transaction in btrfs_search_slot is only useful when we are
      searching to add or modify the tree. When the function is used for
      searching, insert length and mod arguments are 0, there is no need to
      use a transaction.
      
      No functional changes, changing for consistency.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarMarcos Paulo de Souza <mpdesouza@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6534c0c9
    • Filipe Manana's avatar
      btrfs: continue readahead of siblings even if target node is in memory · 069a2e37
      Filipe Manana authored
      At reada_for_search(), when attempting to readahead a node or leaf's
      siblings, we skip the readahead of the siblings if the node/leaf is
      already in memory. That is probably fine for the READA_FORWARD and
      READA_BACK readahead types, as they are used on contexts where we
      end up reading some consecutive leaves, but usually not the whole btree.
      
      However for a READA_FORWARD_ALWAYS mode, currently only used for full
      send operations, it does not make sense to skip the readahead if the
      target node or leaf is already loaded in memory, since we know the caller
      is visiting every node and leaf of the btree in ascending order.
      
      So change the behaviour to not skip the readahead when the target node is
      already in memory and the readahead mode is READA_FORWARD_ALWAYS.
      
      The following test script was used to measure the improvement on a box
      using an average, consumer grade, spinning disk, with 32GiB of RAM and
      using a non-debug kernel config (Debian's default config).
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/sdj
        MNT=/mnt/sdj
        MKFS_OPTIONS="--nodesize 16384"     # default, just to be explicit
        MOUNT_OPTIONS="-o max_inline=2048"  # default, just to be explicit
      
        mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
        mount $MOUNT_OPTIONS $DEV $MNT
      
        # Create files with inline data to make it easier and faster to create
        # large btrees.
        add_files()
        {
            local total=$1
            local start_offset=$2
            local number_jobs=$3
            local total_per_job=$(($total / $number_jobs))
      
            echo "Creating $total new files using $number_jobs jobs"
            for ((n = 0; n < $number_jobs; n++)); do
                (
                    local start_num=$(($start_offset + $n * $total_per_job))
                    for ((i = 1; i <= $total_per_job; i++)); do
                        local file_num=$((start_num + $i))
                        local file_path="$MNT/file_${file_num}"
                        xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null
                        if [ $? -ne 0 ]; then
                            echo "Failed creating file $file_path"
                            break
                        fi
                    done
                ) &
                worker_pids[$n]=$!
            done
      
            wait ${worker_pids[@]}
      
            sync
            echo
            echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV | egrep '^(node|leaf) ' | wc -l)"
        }
      
        file_count=2000000
        add_files $file_count 0 4
      
        echo
        echo "Creating snapshot..."
        btrfs subvolume snapshot -r $MNT $MNT/snap1
      
        umount $MNT
      
        echo 3 > /proc/sys/vm/drop_caches
        blockdev --flushbufs $DEV &> /dev/null
        hdparm -F $DEV &> /dev/null
      
        mount $MOUNT_OPTIONS $DEV $MNT
      
        echo
        echo "Testing full send..."
        start=$(date +%s)
        btrfs send $MNT/snap1 > /dev/null
        end=$(date +%s)
        echo
        echo "Full send took $((end - start)) seconds"
      
        umount $MNT
      
      The duration of the full send operations, in seconds, were the following:
      
      Before this change:  85 seconds
      After this change:   76 seconds (-11.2%)
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      069a2e37