• Filipe Manana's avatar
    btrfs: rework chunk allocation to avoid exhaustion of the system chunk array · 79bd3712
    Filipe Manana authored
    Commit eafa4fd0 ("btrfs: fix exhaustion of the system chunk array
    due to concurrent allocations") fixed a problem that resulted in
    exhausting the system chunk array in the superblock when there are many
    tasks allocating chunks in parallel. Basically too many tasks enter the
    first phase of chunk allocation without previous tasks having finished
    their second phase of allocation, resulting in too many system chunks
    being allocated. That was originally observed when running the fallocate
    tests of stress-ng on a PowerPC machine, using a node size of 64K.
    
    However that commit also introduced a deadlock where a task in phase 1 of
    the chunk allocation waited for another task that had allocated a system
    chunk to finish its phase 2, but that other task was waiting on an extent
    buffer lock held by the first task, therefore resulting in both tasks not
    making any progress. That change was later reverted by a patch with the
    subject "btrfs: fix deadlock with concurrent chunk allocations involving
    system chunks", since there is no simple and short solution to address it
    and the deadlock is relatively easy to trigger on zoned filesystems, while
    the system chunk array exhaustion is not so common.
    
    This change reworks the chunk allocation to avoid the system chunk array
    exhaustion. It accomplishes that by making the first phase of chunk
    allocation do the updates of the device items in the chunk btree and the
    insertion of the new chunk item in the chunk btree. This is done while
    under the protection of the chunk mutex (fs_info->chunk_mutex), in the
    same critical section that checks for available system space, allocates
    a new system chunk if needed and reserves system chunk space. This way
    we do not have chunk space reserved until the second phase completes.
    
    The same logic is applied to chunk removal as well, since it keeps
    reserved system space long after it is done updating the chunk btree.
    
    For direct allocation of system chunks, the previous behaviour remains,
    because otherwise we would deadlock on extent buffers of the chunk btree.
    Changes to the chunk btree are by large done by chunk allocation and chunk
    removal, which first reserve chunk system space and then later do changes
    to the chunk btree. The other remaining cases are uncommon and correspond
    to adding a device, removing a device and resizing a device. All these
    other cases do not pre-reserve system space, they modify the chunk btree
    right away, so they don't hold reserved space for a long period like chunk
    allocation and chunk removal do.
    
    The diff of this change is huge, but more than half of it is just addition
    of comments describing both how things work regarding chunk allocation and
    removal, including both the new behavior and the parts of the old behavior
    that did not change.
    
    CC: stable@vger.kernel.org # 5.12+
    Tested-by: default avatarShin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
    Tested-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
    Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    Tested-by: default avatarDavid Sterba <dsterba@suse.com>
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    79bd3712
transaction.c 70.5 KB