1. 17 Apr, 2023 18 commits
    • Qu Wenruo's avatar
      btrfs: replace btrfs_io_context::raid_map with a fixed u64 value · 18d758a2
      Qu Wenruo authored
      In btrfs_io_context structure, we have a pointer raid_map, which
      indicates the logical bytenr for each stripe.
      
      But considering we always call sort_parity_stripes(), the result
      raid_map[] is always sorted, thus raid_map[0] is always the logical
      bytenr of the full stripe.
      
      So why we waste the space and time (for sorting) for raid_map?
      
      This patch will replace btrfs_io_context::raid_map with a single u64
      number, full_stripe_start, by:
      
      - Replace btrfs_io_context::raid_map with full_stripe_start
      
      - Replace call sites using raid_map[0] to use full_stripe_start
      
      - Replace call sites using raid_map[i] to compare with nr_data_stripes.
      
      The benefits are:
      
      - Less memory wasted on raid_map
        It's sizeof(u64) * num_stripes vs sizeof(u64).
        It'll always save at least one u64, and the benefit grows larger with
        num_stripes.
      
      - No more weird alloc_btrfs_io_context() behavior
        As there is only one fixed size + one variable length array.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      18d758a2
    • Qu Wenruo's avatar
      btrfs: use an efficient way to represent source of duplicated stripes · 1faf3885
      Qu Wenruo authored
      For btrfs dev-replace, we have to duplicate writes to the source
      device into the target device.
      
      For non-RAID56, all writes into the same mapped ranges are sharing the
      same content, thus they don't really need to bother anything.
      (E.g. in btrfs_submit_bio() for non-RAID56 range we just submit the
      same write to all involved devices).
      
      But for RAID56, all stripes contain different content, thus we must
      have a clear mapping of which stripe is duplicated from which original
      stripe.
      
      Currently we use a complex way using tgtdev_map[] array, e.g:
      
       num_tgtdevs = 1
       tgtdev_map[0] = 0    <- Means stripes[0] is not involved in replace.
       tgtdev_map[1] = 3    <- Means stripes[1] is involved in replace,
      			 and it's duplicated to stripes[3].
       tgtdev_map[2] = 0    <- Means stripes[2] is not involved in replace.
      
      But this is wasting some space, and ignores one important thing for
      dev-replace, there is at most one running replace.
      
      Thus we can change it to a fixed array to represent the mapping:
      
       replace_nr_stripes = 1
       replace_stripe_src = 1    <- Means stripes[1] is involved in replace.
      			      thus the extra stripe is a copy of
      			      stripes[1]
      
      By this we can save some space for bioc on RAID56 chunks with many
      devices.  And we get rid of one variable sized array from bioc.
      
      Thus the patch involves the following changes:
      
      - Replace @num_tgtdevs and @tgtdev_map[] with @replace_nr_stripes
        and @replace_stripe_src.
      
        @num_tgtdevs is just renamed to @replace_nr_stripes.
        While the mapping is completely changed.
      
      - Add extra ASSERT()s for RAID56 code
      
      - Only add two more extra stripes for dev-replace cases.
        As we have an upper limit on how many dev-replace stripes we can have.
      
      - Unify the behavior of handle_ops_on_dev_replace()
        Previously handle_ops_on_dev_replace() go two different paths for
        WRITE and GET_READ_MIRRORS.
        Now unify them by always going the WRITE path first (with at most 2
        replace stripes), then if we're doing GET_READ_MIRRORS and we have 2
        extra stripes, just drop one stripe.
      
      - Remove the @real_stripes argument from alloc_btrfs_io_context()
        As we don't need the old variable length array any more.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1faf3885
    • Qu Wenruo's avatar
      btrfs: reduce type width of btrfs_io_contexts · 4ced85f8
      Qu Wenruo authored
      That structure is our ultimate object for all __btrfs_map_block()
      related functions.  We have some hard to understand members, like
      tgtdev_map, but without any comments.
      
      This patch will improve the situation:
      
      - Add extra comments for num_stripes, mirror_num, num_tgtdevs and
        tgtdev_map[]
        Especially for the last two members, add a dedicated (thus very long)
        comments for them, with example to explain it.
      
      - Shrink those int members to u16.
        In fact our on-disk format is only using u16 for num_stripes, thus
        no need to use int at all.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4ced85f8
    • Qu Wenruo's avatar
      btrfs: simplify the bioc argument for handle_ops_on_dev_replace() · be5c7edb
      Qu Wenruo authored
      There is no memory re-allocation for handle_ops_on_dev_replace(), thus
      we don't need to pass a btrfs_io_context pointer.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      be5c7edb
    • Qu Wenruo's avatar
      btrfs: reduce div64 calls by limiting the number of stripes of a chunk to u32 · 6ded22c1
      Qu Wenruo authored
      There are quite some div64 calls inside btrfs_map_block() and its
      variants.
      
      Such calls are for @stripe_nr, where @stripe_nr is the number of
      stripes before our logical bytenr inside a chunk.
      
      However we can eliminate such div64 calls by just reducing the width of
      @stripe_nr from 64 to 32.
      
      This can be done because our chunk size limit is already 10G, with fixed
      stripe length 64K.
      Thus a U32 is definitely enough to contain the number of stripes.
      
      With such width reduction, we can get rid of slower div64, and extra
      warning for certain 32bit arch.
      
      This patch would do:
      
      - Add a new tree-checker chunk validation on chunk length
        Make sure no chunk can reach 256G, which can also act as a bitflip
        checker.
      
      - Reduce the width from u64 to u32 for @stripe_nr variables
      
      - Replace unnecessary div64 calls with regular modulo and division
        32bit division and modulo are much faster than 64bit operations, and
        we are finally free of the div64 fear at least in those involved
        functions.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6ded22c1
    • Qu Wenruo's avatar
      btrfs: replace map_lookup->stripe_len by BTRFS_STRIPE_LEN · a97699d1
      Qu Wenruo authored
      Currently btrfs doesn't support stripe lengths other than 64KiB.
      This is already set in the tree-checker.
      
      There is really no meaning to record that fixed value in map_lookup for
      now, and can all be replaced with BTRFS_STRIPE_LEN.
      
      Furthermore we can use the fix stripe length to do the following
      optimization:
      
      - Use BTRFS_STRIPE_LEN_SHIFT to replace some 64bit division
        Now we only need to do a right shift.
      
        And the value of BTRFS_STRIPE_LEN itself is already too large for bit
        shift, thus if we accidentally use BTRFS_STRIPE_LEN to do bit shift,
        a compiler warning would be triggered.
      
        Thus this bit shift optimization would be safe.
      
      - Use BTRFS_STRIPE_LEN_MASK to calculate the offset inside a stripe
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a97699d1
    • Christoph Hellwig's avatar
      btrfs: move all btree inode initialization into btrfs_init_btree_inode · dcb2137c
      Christoph Hellwig authored
      Move the remaining code that deals with initializing the btree
      inode into btrfs_init_btree_inode instead of splitting it between
      that helpers and its only caller.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      dcb2137c
    • Anand Jain's avatar
      btrfs: switch search_file_offset_in_bio to return bool · 19337f8e
      Anand Jain authored
      Function search_file_offset_in_bio() finds the file offset in the
      file_offset_ret, and we use the return value to indicate if it is
      successful, so use bool.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      19337f8e
    • Anand Jain's avatar
      btrfs: avoid reusing return variable in nested block in btrfs_lookup_bio_sums · da8269a3
      Anand Jain authored
      The function btrfs_lookup_bio_sums() and a nested if statement declare
      ret respectively as blk_status_t and int.
      
      There is no need to store the return value of
      search_file_offset_in_bio() to ret as this is a one-time call.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      da8269a3
    • Johannes Thumshirn's avatar
      btrfs: open code btrfs_csum_ptr · fa13661c
      Johannes Thumshirn authored
      Remove btrfs_csum_ptr() and fold it into it's only caller.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fa13661c
    • Christoph Hellwig's avatar
      btrfs: raid56: no need for irqsafe locking · 74cc3600
      Christoph Hellwig authored
      These days all the operations that take locks in the raid56.c code are
      run from user context (mostly workqueues).  Drop all the irqsafe locking
      that is not required any more.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      74cc3600
    • Josef Bacik's avatar
      btrfs: abort the transaction if we get an error during snapshot drop · 9a93b5a3
      Josef Bacik authored
      We were seeing weird errors when we were testing our btrfs backports
      before we had the incorrect level check fix.  These errors appeared to
      be improper error handling, but error injection testing uncovered that
      the errors were a result of corruption that occurred from improper error
      handling during snapshot delete.
      
      With snapshot delete if we encounter any errors during walk_down or
      walk_up we'll simply return an error, we won't abort the transaction.
      This is problematic because we will be dropping references for nodes and
      leaves along the way, and if we fail in the middle we will leave the
      file system corrupt because we don't know where we left off in the drop.
      
      Fix this by making sure we abort if we hit any errors during the walk
      down or walk up operations, as we have no idea what operations could
      have been left half done at this point.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9a93b5a3
    • Josef Bacik's avatar
      btrfs: handle errors in walk_down_tree properly · 4e194384
      Josef Bacik authored
      We can get errors in walk_down_proc as we try and lookup extent info for
      the snapshot dropping to act on.  However if we get an error we simply
      return 1 which indicates we're done with walking down, which will lead
      us to improperly continue with the snapshot drop with the incorrect
      information.  Instead break if we get any error from walk_down_proc or
      do_walk_down, and handle the case of ret == 1 by returning 0, otherwise
      return the ret value that we have.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4e194384
    • Josef Bacik's avatar
      btrfs: drop root refs properly when orphan cleanup fails · 6989627d
      Josef Bacik authored
      When we mount the file system we do something like this:
      
      	while (1) {
      		lookup fs roots;
      
      		for (i = 0; i < num_roots; i++) {
      			ret = btrfs_orphan_cleanup(roots[i]);
      			if (ret)
      				break;
      			btrfs_put_root(roots[i]);
      		}
      	}
      
      	for (; i < num_roots; i++)
      		btrfs_put_root(roots[i]);
      
      As you can see if we break in that inner loop we just go back to the
      outer loop and lose the fact that we have to drop references on the
      remaining roots we looked up.  Fix this by making an out label and
      jumping to that on error so we don't leak a reference to the roots we
      looked up.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6989627d
    • Josef Bacik's avatar
      btrfs: add missing iputs on orphan cleanup failure · a13bb2c0
      Josef Bacik authored
      We missed a couple of iput()s in the orphan cleanup failure paths, add
      them so we don't get refcount errors. The iput needs to be done in the
      check and not under a common label due to the way the code is
      structured.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a13bb2c0
    • Josef Bacik's avatar
      btrfs: handle errors from btrfs_read_node_slot in split · 9cf14029
      Josef Bacik authored
      While investigating a problem with error injection I tripped over
      curious behavior in the node/leaf splitting code.  If we get an EIO when
      trying to read either the left or right leaf/node for splitting we'll
      simply treat the node as if it were full and continue on.  The end
      result of this isn't too bad, we simply end up allocating a block when
      we may have pushed items into the adjacent blocks.
      
      However this does essentially allow us to continue to modify a file
      system that we've gotten errors on, either from a bad disk or csum
      mismatch or other corruption.  This isn't particularly safe, so instead
      handle these btrfs_read_node_slot() usages differently.  We allow you to
      pass in any slot, the idea being that we save some code if the slot
      number is outside of the range of the parent.  This means we treat all
      errors the same, when in reality we only want to ignore -ENOENT.
      
      Fix this by changing how we call btrfs_read_node_slot(), which is to
      only call it for slots we know are valid.  This way if we get an error
      back from reading the block we can properly pass the error up the chain.
      This was validated with the error injection testing I was doing.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9cf14029
    • Josef Bacik's avatar
      btrfs: replace BUG_ON with ASSERT in btrfs_read_node_slot · d4694728
      Josef Bacik authored
      In btrfs_read_node_slot() we have a BUG_ON() that can be converted to an
      ASSERT(), it's from an extent buffer and the level is validated at the
      time it's read from disk.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d4694728
    • Josef Bacik's avatar
      btrfs: use btrfs_handle_fs_error in btrfs_fill_super · 13b98989
      Josef Bacik authored
      While trying to track down a lost EIO problem I hit the following
      assertion while doing my error injection testing
      
        BTRFS warning (device nvme1n1): transaction 1609 (with 180224 dirty metadata bytes) is not committed
        assertion failed: !found, in fs/btrfs/disk-io.c:4456
        ------------[ cut here ]------------
        kernel BUG at fs/btrfs/messages.h:169!
        invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
        CPU: 0 PID: 1445 Comm: mount Tainted: G        W          6.2.0-rc5+ #3
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.1-2.fc37 04/01/2014
        RIP: 0010:btrfs_assertfail.constprop.0+0x18/0x1a
        RSP: 0018:ffffb95fc3b0bc68 EFLAGS: 00010286
        RAX: 0000000000000034 RBX: ffff9941c2ac2000 RCX: 0000000000000000
        RDX: 0000000000000001 RSI: ffffffffb6741f7d RDI: 00000000ffffffff
        RBP: ffff9941c2ac2428 R08: 0000000000000000 R09: ffffb95fc3b0bb38
        R10: 0000000000000003 R11: ffffffffb71438a8 R12: ffff9941c2ac2428
        R13: ffff9941c2ac2450 R14: ffff9941c2ac2450 R15: 000000000002c000
        FS:  00007fcea2d07800(0000) GS:ffff9941fbc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f00cc7c83a8 CR3: 000000010c686000 CR4: 0000000000350ef0
        Call Trace:
         <TASK>
         close_ctree+0x426/0x48f
         btrfs_mount_root.cold+0x7e/0xee
         ? legacy_parse_param+0x2b/0x220
         legacy_get_tree+0x2b/0x50
         vfs_get_tree+0x29/0xc0
         vfs_kern_mount.part.0+0x73/0xb0
         btrfs_mount+0x11d/0x3d0
         ? legacy_parse_param+0x2b/0x220
         legacy_get_tree+0x2b/0x50
         vfs_get_tree+0x29/0xc0
         path_mount+0x438/0xa40
         __x64_sys_mount+0xe9/0x130
         do_syscall_64+0x3e/0x90
         entry_SYSCALL_64_after_hwframe+0x72/0xdc
      
      This is because the error injection did an EIO for the root inode lookup
      and we simply jumped to closing the ctree.  However because we didn't
      mark the file system as having an error we skipped all of the broken
      transaction cleanup stuff, and thus triggered this ASSERT().  Fix this
      by calling btrfs_handle_fs_error() in this case so we have the error set
      on the file system.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      13b98989
  2. 16 Apr, 2023 12 commits
  3. 15 Apr, 2023 6 commits
  4. 14 Apr, 2023 4 commits