1. 25 Jul, 2022 21 commits
    • Filipe Manana's avatar
      btrfs: free the path earlier when creating a new inode · 814e7718
      Filipe Manana authored
      When creating an inode, through btrfs_create_new_inode(), we release the
      path we allocated before once we don't need it anymore. But we keep it
      allocated until we return from that function, which is wasteful because
      after we release the path we do several things that can allocate yet
      another path: inheriting properties, setting the xattrs used by ACLs and
      secutiry modules, adding an orphan item (O_TMPFILE case) or adding a
      dir item (for the non-O_TMPFILE case).
      
      So instead of releasing the path once we don't need it anymore, free it
      instead. This way we avoid having two paths allocated until we return
      from btrfs_create_new_inode().
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      814e7718
    • Filipe Manana's avatar
      btrfs: balance btree dirty pages and delayed items after a rename · ca6dee6b
      Filipe Manana authored
      A rename operation modifies a subvolume's btree, to remove the old dir
      item, add the new dir item, remove an inode ref and add a new inode ref.
      It can also create the delayed inode for the inodes involved in the
      operation, and it creates two delayed dir index items, one to delete
      the old name and another one to add the new name.
      
      However we are neither balancing the btree dirty pages nor the delayed
      items after a rename, which can result in accumulation of too many
      btree dirty pages and delayed items, specially if a task is doing a
      series of rename operations (for example it can happen for package
      installations/upgrades through the zypper tool).
      
      So just call btrfs_btree_balance_dirty() after a rename, just like we
      do for every other system call that results on modifying a btree and
      adding delayed items.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ca6dee6b
    • Qu Wenruo's avatar
      btrfs: add trace event for submitted RAID56 bio · b8bea09a
      Qu Wenruo authored
      Add tracepoint for better insight to how the RAID56 data are submitted.
      
      The output looks like this: (trace event header and UUID skipped)
      
         raid56_read_partial: full_stripe=389152768 devid=3 type=DATA1 offset=32768 opf=0x0 physical=323059712 len=32768
         raid56_read_partial: full_stripe=389152768 devid=1 type=DATA2 offset=0 opf=0x0 physical=67174400 len=65536
         raid56_write_stripe: full_stripe=389152768 devid=3 type=DATA1 offset=0 opf=0x1 physical=323026944 len=32768
         raid56_write_stripe: full_stripe=389152768 devid=2 type=PQ1 offset=0 opf=0x1 physical=323026944 len=32768
      
      The above debug output is from a 32K data write into an empty RAID56
      data chunk.
      
      Some explanation on the event output:
      
        full_stripe:	the logical bytenr of the full stripe
        devid:	btrfs devid
        type:		raid stripe type.
               	DATA1:	the first data stripe
               	DATA2:	the second data stripe
               	PQ1:	the P stripe
               	PQ2:	the Q stripe
        offset:	the offset inside the stripe.
        opf:		the bio op type
        physical:	the physical offset the bio is for
        len:		the length of the bio
      
      The first two lines are from partial RMW read, which is reading the
      remaining data stripes from disks.
      
      The last two lines are for full stripe RMW write, which is writing the
      involved two 16K stripes (one for DATA1 stripe, one for P stripe).
      The stripe for DATA2 doesn't need to be written.
      
      There are 5 types of trace events:
      
      - raid56_read_partial
        Read remaining data for regular read/write path.
      
      - raid56_write_stripe
        Write the modified stripes for regular read/write path.
      
      - raid56_scrub_read_recover
        Read remaining data for scrub recovery path.
      
      - raid56_scrub_write_stripe
        Write the modified stripes for scrub path.
      
      - raid56_scrub_read
        Read remaining data for scrub path.
      
      Also, since the trace events are included at super.c, we have to export
      needed structure definitions to 'raid56.h' and include the header in
      super.c, or we're unable to access those members.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ reformat comments ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b8bea09a
    • Qu Wenruo's avatar
      btrfs: update stripe_sectors::uptodate in steal_rbio · 4d100466
      Qu Wenruo authored
      [BUG]
      With added debugging, it turns out the following write sequence would
      cause extra read which is unnecessary:
      
        # xfs_io -f -s -c "pwrite -b 32k 0 32k" -c "pwrite -b 32k 32k 32k" \
      		 -c "pwrite -b 32k 64k 32k" -c "pwrite -b 32k 96k 32k" \
      		 $mnt/file
      
      The debug message looks like this (btrfs header skipped):
      
        partial rmw, full stripe=389152768 opf=0x0 devid=3 type=1 offset=32768 physical=323059712 len=32768
        partial rmw, full stripe=389152768 opf=0x0 devid=1 type=2 offset=0 physical=67174400 len=65536
        full stripe rmw, full stripe=389152768 opf=0x1 devid=3 type=1 offset=0 physical=323026944 len=32768
        full stripe rmw, full stripe=389152768 opf=0x1 devid=2 type=-1 offset=0 physical=323026944 len=32768
        partial rmw, full stripe=298844160 opf=0x0 devid=1 type=1 offset=32768 physical=22052864 len=32768
        partial rmw, full stripe=298844160 opf=0x0 devid=2 type=2 offset=0 physical=277872640 len=65536
        full stripe rmw, full stripe=298844160 opf=0x1 devid=1 type=1 offset=0 physical=22020096 len=32768
        full stripe rmw, full stripe=298844160 opf=0x1 devid=3 type=-1 offset=0 physical=277872640 len=32768
        partial rmw, full stripe=389152768 opf=0x0 devid=3 type=1 offset=0 physical=323026944 len=32768
        partial rmw, full stripe=389152768 opf=0x0 devid=1 type=2 offset=0 physical=67174400 len=65536
        ^^^^
         Still partial read, even 389152768 is already cached by the first.
         write.
      
        full stripe rmw, full stripe=389152768 opf=0x1 devid=3 type=1 offset=32768 physical=323059712 len=32768
        full stripe rmw, full stripe=389152768 opf=0x1 devid=2 type=-1 offset=32768 physical=323059712 len=32768
        partial rmw, full stripe=298844160 opf=0x0 devid=1 type=1 offset=0 physical=22020096 len=32768
        partial rmw, full stripe=298844160 opf=0x0 devid=2 type=2 offset=0 physical=277872640 len=65536
        ^^^^
         Still partial read for 298844160.
      
        full stripe rmw, full stripe=298844160 opf=0x1 devid=1 type=1 offset=32768 physical=22052864 len=32768
        full stripe rmw, full stripe=298844160 opf=0x1 devid=3 type=-1 offset=32768 physical=277905408 len=32768
      
      This means every 32K writes, even they are in the same full stripe,
      still trigger read for previously cached data.
      
      This would cause extra RAID56 IO, making the btrfs raid56 cache useless.
      
      [CAUSE]
      Commit d4e28d9b ("btrfs: raid56: make steal_rbio() subpage
      compatible") tries to make steal_rbio() subpage compatible, but during
      that conversion, there is one thing missing.
      
      We no longer rely on PageUptodate(rbio->stripe_pages[i]), but
      rbio->stripe_nsectors[i].uptodate to determine if a sector is uptodate.
      
      This means, previously if we switch the pointer, everything is done,
      as the PageUptodate flag is still bound to that page.
      
      But now we have to manually mark the involved sectors uptodate, or later
      raid56_rmw_stripe() will find the stolen sector is not uptodate, and
      assemble the read bio for it, wasting IO.
      
      [FIX]
      We can easily fix the bug, by also update the
      rbio->stripe_sectors[].uptodate in steal_rbio().
      
      With this fixed, now the same write pattern no longer leads to the same
      unnecessary read:
      
        partial rmw, full stripe=389152768 opf=0x0 devid=3 type=1 offset=32768 physical=323059712 len=32768
        partial rmw, full stripe=389152768 opf=0x0 devid=1 type=2 offset=0 physical=67174400 len=65536
        full stripe rmw, full stripe=389152768 opf=0x1 devid=3 type=1 offset=0 physical=323026944 len=32768
        full stripe rmw, full stripe=389152768 opf=0x1 devid=2 type=-1 offset=0 physical=323026944 len=32768
        partial rmw, full stripe=298844160 opf=0x0 devid=1 type=1 offset=32768 physical=22052864 len=32768
        partial rmw, full stripe=298844160 opf=0x0 devid=2 type=2 offset=0 physical=277872640 len=65536
        full stripe rmw, full stripe=298844160 opf=0x1 devid=1 type=1 offset=0 physical=22020096 len=32768
        full stripe rmw, full stripe=298844160 opf=0x1 devid=3 type=-1 offset=0 physical=277872640 len=32768
        ^^^ No more partial read, directly into the write path.
        full stripe rmw, full stripe=389152768 opf=0x1 devid=3 type=1 offset=32768 physical=323059712 len=32768
        full stripe rmw, full stripe=389152768 opf=0x1 devid=2 type=-1 offset=32768 physical=323059712 len=32768
        full stripe rmw, full stripe=298844160 opf=0x1 devid=1 type=1 offset=32768 physical=22052864 len=32768
        full stripe rmw, full stripe=298844160 opf=0x1 devid=3 type=-1 offset=32768 physical=277905408 len=32768
      
      Fixes: d4e28d9b ("btrfs: raid56: make steal_rbio() subpage compatible")
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4d100466
    • David Sterba's avatar
      btrfs: remove redundant calls to flush_dcache_page · 21a8935e
      David Sterba authored
      Both memzero_page and memcpy_to_page already call flush_dcache_page so
      we can remove the calls from btrfs code.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      21a8935e
    • Qu Wenruo's avatar
      btrfs: only write the sectors in the vertical stripe which has data stripes · bd8f7e62
      Qu Wenruo authored
      If we have only 8K partial write at the beginning of a full RAID56
      stripe, we will write the following contents:
      
                          0  8K           32K             64K
      Disk 1	(data):     |XX|            |               |
      Disk 2  (data):     |               |               |
      Disk 3  (parity):   |XXXXXXXXXXXXXXX|XXXXXXXXXXXXXXX|
      
      |X| means the sector will be written back to disk.
      
      Note that, although we won't write any sectors from disk 2, but we will
      write the full 64KiB of parity to disk.
      
      This behavior is fine for now, but not for the future (especially for
      RAID56J, as we waste quite some space to journal the unused parity
      stripes).
      
      So here we will also utilize the btrfs_raid_bio::dbitmap, anytime we
      queue a higher level bio into an rbio, we will update rbio::dbitmap to
      indicate which vertical stripes we need to writeback.
      
      And at finish_rmw(), we also check dbitmap to see if we need to write
      any sector in the vertical stripe.
      
      So after the patch, above example will only lead to the following
      writeback pattern:
      
                          0  8K           32K             64K
      Disk 1	(data):     |XX|            |               |
      Disk 2  (data):     |               |               |
      Disk 3  (parity):   |XX|            |               |
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bd8f7e62
    • Qu Wenruo's avatar
      btrfs: use integrated bitmaps for scrub_parity::dbitmap and ebitmap · 381b9b4c
      Qu Wenruo authored
      Previously we use "unsigned long *" for those two bitmaps.
      
      But since we only support fixed stripe length (64KiB, already checked in
      tree-checker), "unsigned long *" is really a waste of memory, while we
      can just use "unsigned long".
      
      This saves us 8 bytes in total for scrub_parity.
      
      To be extra safe, add an ASSERT() making sure calclulated @nsectors is
      always smaller than BITS_PER_LONG.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      381b9b4c
    • Qu Wenruo's avatar
      btrfs: use integrated bitmaps for btrfs_raid_bio::dbitmap and finish_pbitmap · c67c68eb
      Qu Wenruo authored
      Previsouly we use "unsigned long *" for those two bitmaps.
      
      But since we only support fixed stripe length (64KiB, already checked in
      tree-checker), "unsigned long *" is really a waste of memory, while we
      can just use "unsigned long".
      
      This saves us 8 bytes in total for btrfs_raid_bio.
      
      To be extra safe, add an ASSERT() making sure calculated
      @stripe_nsectors is always smaller than BITS_PER_LONG.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c67c68eb
    • Nikolay Borisov's avatar
      btrfs: use btrfs_try_lock_balance in btrfs_ioctl_balance · 099aa972
      Nikolay Borisov authored
      This eliminates 2 labels and makes the code generally more streamlined.
      Also rename the 'out_bargs' label to 'out_unlock' since bargs is going
      to be freed under the 'out' label. This also fixes a memory leak since
      bargs wasn't correctly freed in one of the condition which are now moved
      in btrfs_try_lock_balance.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      099aa972
    • Nikolay Borisov's avatar
      btrfs: introduce btrfs_try_lock_balance · 7fb10ed8
      Nikolay Borisov authored
      This function contains the factored out locking sequence of
      btrfs_ioctl_balance. Having this piece of code separate helps to
      simplify btrfs_ioctl_balance which has too complicated.  This will be
      used in the next patch to streamline the logic in btrfs_ioctl_balance.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7fb10ed8
    • Christoph Hellwig's avatar
      btrfs: use btrfs_bio_for_each_sector in btrfs_check_read_dio_bio · 1e87770c
      Christoph Hellwig authored
      Use the new btrfs_bio_for_each_sector iterator to simplify
      btrfs_check_read_dio_bio.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1e87770c
    • Qu Wenruo's avatar
      btrfs: add a helper to iterate through a btrfs_bio with sector sized chunks · 261d812b
      Qu Wenruo authored
      Add a helper that works similar to __bio_for_each_segment, but instead of
      iterating over PAGE_SIZE chunks it iterates over each sector.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      [hch: split from a larger patch, and iterate over the offset instead of
            the offset bits]
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ add parameter comments ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      261d812b
    • Christoph Hellwig's avatar
      btrfs: factor out a btrfs_csum_ptr helper · a89ce08c
      Christoph Hellwig authored
      Add a helper to find the csum for a byte offset into the csum buffer.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a89ce08c
    • Christoph Hellwig's avatar
      btrfs: refactor end_bio_extent_readpage code flow · 97861cd1
      Christoph Hellwig authored
      Untangle the goto and move the code it jumps to so it goes in the order
      of the most likely states first.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ update changelog ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      97861cd1
    • Christoph Hellwig's avatar
      btrfs: factor out a helper to end a single sector buffer I/O · a5aa7ab6
      Christoph Hellwig authored
      Add a helper to end I/O on a single sector, which will come in handy
      with the new read repair code.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a5aa7ab6
    • Qu Wenruo's avatar
      btrfs: remove duplicated parameters from submit_data_read_repair() · fd5a6f63
      Qu Wenruo authored
      The function submit_data_read_repair() is only called for buffered data
      read path, thus those members can be calculated using bvec directly:
      
      - start
        start = page_offset(bvec->bv_page) + bvec->bv_offset;
      
      - end
        end = start + bvec->bv_len - 1;
      
      - page
        page = bvec->bv_page;
      
      - pgoff
        pgoff = bvec->bv_offset;
      
      Thus we can safely replace those 4 parameters with just one bio_vec.
      
      Also remove the unused return value.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      [hch: also remove the return value]
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fd5a6f63
    • Qu Wenruo's avatar
      btrfs: introduce a data checksum checking helper · ae643a74
      Qu Wenruo authored
      Although we have several data csum verification code, we never have a
      function really just to verify checksum for one sector.
      
      Function check_data_csum() do extra work for error reporting, thus it
      requires a lot of extra things like file offset, bio_offset etc.
      
      Function btrfs_verify_data_csum() is even worse, it will utilize page
      checked flag, which means it can not be utilized for direct IO pages.
      
      Here we introduce a new helper, btrfs_check_sector_csum(), which really
      only accept a sector in page, and expected checksum pointer.
      
      We use this function to implement check_data_csum(), and export it for
      incoming patch.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      [hch: keep passing the csum array as an arguments, as the callers want
            to print it, rename per request]
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ae643a74
    • Qu Wenruo's avatar
      btrfs: quit early if the fs has no RAID56 support for raid56 related checks · b036f479
      Qu Wenruo authored
      The following functions do special handling for RAID56 chunks:
      
      - btrfs_is_parity_mirror()
        Check if the range is in RAID56 chunks.
      
      - btrfs_full_stripe_len()
        Either return sectorsize for non-RAID56 profiles or full stripe length
        for RAID56 chunks.
      
      But if a filesystem without any RAID56 chunks, it will not have RAID56
      incompat flags, and we can skip the chunk tree looking up completely.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b036f479
    • Fanjun Kong's avatar
      btrfs: use PAGE_ALIGNED instead of IS_ALIGNED · 1280d2d1
      Fanjun Kong authored
      The <linux/mm.h> already provides the PAGE_ALIGNED macro. Let's
      use it instead of IS_ALIGNED and passing PAGE_SIZE directly.
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFanjun Kong <bh1scw@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1280d2d1
    • Pankaj Raghav's avatar
      btrfs: zoned: fix comment description for sb_write_pointer logic · 31f37269
      Pankaj Raghav authored
      Fix the comment to represent the actual logic used for sb_write_pointer
      
      - Empty[0] && In use[1] should be an invalid state instead of returning
        zone 0 wp
      - Empty[0] && Full[1] should be returning zone 0 wp instead of zone 1 wp
      - In use[0] && Empty[1] should be returning zone 0 wp instead of being an
        invalid state
      - In use[0] && Full[1] should be returning zone 0 wp instead of returning
        zone 1 wp
      - Full[0] && Empty[1] should be returning zone 1 wp instead of returning
        zone 0 wp
      - Full[0] && In use[1] should be returning zone 1 wp instead of returning
        zone 0 wp
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarPankaj Raghav <p.raghav@samsung.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      31f37269
    • David Sterba's avatar
      btrfs: fix typos in comments · 143823cf
      David Sterba authored
      Codespell has found a few typos.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      143823cf
  2. 24 Jul, 2022 6 commits
  3. 23 Jul, 2022 2 commits
  4. 22 Jul, 2022 11 commits