• Filipe Manana's avatar
    btrfs: fix race between finishing block group creation and its item update · 2d6cd791
    Filipe Manana authored
    Commit 675dfe12 ("btrfs: fix block group item corruption after
    inserting new block group") fixed one race that resulted in not persisting
    a block group's item when its "used" bytes field decreases to zero.
    However there's another race that can happen in a much shorter time window
    that results in the same problem. The following sequence of steps explains
    how it can happen:
    
    1) Task A creates a metadata block group X, its "used" and "commit_used"
       fields are initialized to 0;
    
    2) Two extents are allocated from block group X, so its "used" field is
       updated to 32K, and its "commit_used" field remains as 0;
    
    3) Transaction commit starts, by some task B, and it enters
       btrfs_start_dirty_block_groups(). There it tries to update the block
       group item for block group X, which currently has its "used" field with
       a value of 32K and its "commit_used" field with a value of 0. However
       that fails since the block group item was not yet inserted, so at
       update_block_group_item(), the btrfs_search_slot() call returns 1, and
       then we set 'ret' to -ENOENT. Before jumping to the label 'fail'...
    
    4) The block group item is inserted by task A, when for example
       btrfs_create_pending_block_groups() is called when releasing its
       transaction handle. This results in insert_block_group_item() inserting
       the block group item in the extent tree (or block group tree), with a
       "used" field having a value of 32K and setting "commit_used", in struct
       btrfs_block_group, to the same value (32K);
    
    5) Task B jumps to the 'fail' label and then resets the "commit_used"
       field to 0. At btrfs_start_dirty_block_groups(), because -ENOENT was
       returned from update_block_group_item(), we add the block group again
       to the list of dirty block groups, so that we will try again in the
       critical section of the transaction commit when calling
       btrfs_write_dirty_block_groups();
    
    6) Later the two extents from block group X are freed, so its "used" field
       becomes 0;
    
    7) If no more extents are allocated from block group X before we get into
       btrfs_write_dirty_block_groups(), then when we call
       update_block_group_item() again for block group X, we will not update
       the block group item to reflect that it has 0 bytes used, because the
       "used" and "commit_used" fields in struct btrfs_block_group have the
       same value, a value of 0.
    
       As a result after committing the transaction we have an empty block
       group with its block group item having a 32K value for its "used" field.
       This will trigger errors from fsck ("btrfs check" command) and after
       mounting again the fs, the cleaner kthread will not automatically delete
       the empty block group, since its "used" field is not 0. Possibly there
       are other issues due to this inconsistency.
    
       When this issue happens, the error reported by fsck is like this:
    
         [1/7] checking root items
         [2/7] checking extents
         block group [1104150528 1073741824] used 39796736 but extent items used 0
         ERROR: errors found in extent allocation tree or chunk allocation
         (...)
    
    So fix this by not resetting the "commit_used" field of a block group when
    we don't find the block group item at update_block_group_item().
    
    Fixes: 7248e0ce ("btrfs: skip update of block group item if used bytes are the same")
    CC: stable@vger.kernel.org # 6.2+
    Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
    Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    2d6cd791
block-group.c 135 KB