• Filipe Manana's avatar
    Btrfs: fix hang during inode eviction due to concurrent readahead · 6ca07097
    Filipe Manana authored
    Zygo Blaxell and other users have reported occasional hangs while an
    inode is being evicted, leading to traces like the following:
    
    [ 5281.972322] INFO: task rm:20488 blocked for more than 120 seconds.
    [ 5281.973836]       Not tainted 4.0.0-rc5-btrfs-next-9+ #2
    [ 5281.974818] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [ 5281.976364] rm              D ffff8800724cfc38     0 20488   7747 0x00000000
    [ 5281.977506]  ffff8800724cfc38 ffff8800724cfc38 ffff880065da5c50 0000000000000001
    [ 5281.978461]  ffff8800724cffd8 ffff8801540a5f50 0000000000000008 ffff8801540a5f78
    [ 5281.979541]  ffff8801540a5f50 ffff8800724cfc58 ffffffff8143107e 0000000000000123
    [ 5281.981396] Call Trace:
    [ 5281.982066]  [<ffffffff8143107e>] schedule+0x74/0x83
    [ 5281.983341]  [<ffffffffa03b33cf>] wait_on_state+0xac/0xcd [btrfs]
    [ 5281.985127]  [<ffffffff81075cd6>] ? signal_pending_state+0x31/0x31
    [ 5281.986715]  [<ffffffffa03b4b71>] wait_extent_bit.constprop.32+0x7c/0xde [btrfs]
    [ 5281.988680]  [<ffffffffa03b540b>] lock_extent_bits+0x5d/0x88 [btrfs]
    [ 5281.990200]  [<ffffffffa03a621d>] btrfs_evict_inode+0x24e/0x5be [btrfs]
    [ 5281.991781]  [<ffffffff8116964d>] evict+0xa0/0x148
    [ 5281.992735]  [<ffffffff8116a43d>] iput+0x18f/0x1e5
    [ 5281.993796]  [<ffffffff81160d4a>] do_unlinkat+0x15b/0x1fa
    [ 5281.994806]  [<ffffffff81435b54>] ? ret_from_sys_call+0x1d/0x58
    [ 5281.996120]  [<ffffffff8107d314>] ? trace_hardirqs_on_caller+0x18f/0x1ab
    [ 5281.997562]  [<ffffffff8123960b>] ? trace_hardirqs_on_thunk+0x3a/0x3f
    [ 5281.998815]  [<ffffffff81161a16>] SyS_unlinkat+0x29/0x2b
    [ 5281.999920]  [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
    [ 5282.001299] 1 lock held by rm/20488:
    [ 5282.002066]  #0:  (sb_writers#12){.+.+.+}, at: [<ffffffff8116dd81>] mnt_want_write+0x24/0x4b
    
    This happens when we have readahead, which calls readpages(), happening
    right before the inode eviction handler is invoked. So the reason is
    essentially:
    
    1) readpages() is called while a reference on the inode is held, so
       eviction can not be triggered before readpages() returns. It also
       locks one or more ranges in the inode's io_tree (which is done at
       extent_io.c:__do_contiguous_readpages());
    
    2) readpages() submits several read bios, all with an end io callback
       that runs extent_io.c:end_bio_extent_readpage() and that is executed
       by other task when a bio finishes, corresponding to a work queue
       (fs_info->end_io_workers) worker kthread. This callback unlocks
       the ranges in the inode's io_tree that were previously locked in
       step 1;
    
    3) readpages() returns, the reference on the inode is dropped;
    
    4) One or more of the read bios previously submitted are still not
       complete (their end io callback was not yet invoked or has not
       yet finished execution);
    
    5) Inode eviction is triggered (through an unlink call for example).
       The inode reference count was not incremented before submitting
       the read bios, therefore this is possible;
    
    6) The eviction handler starts executing and enters the loop that
       iterates over all extent states in the inode's io_tree;
    
    7) The loop picks one extent state record and uses its ->start and
       ->end fields, after releasing the inode's io_tree spinlock, to
       call lock_extent_bits() and clear_extent_bit(). The call to lock
       the range [state->start, state->end] blocks because the whole
       range or a part of it was locked by the previous call to
       readpages() and the corresponding end io callback, which unlocks
       the range was not yet executed;
    
    8) The end io callback for the read bio is executed and unlocks the
       range [state->start, state->end] (or a superset of that range).
       And at clear_extent_bit() the extent_state record state is used
       as a second argument to split_state(), which sets state->start to
       a larger value;
    
    9) The task executing the eviction handler is woken up by the task
       executing the bio's end io callback (through clear_state_bit) and
       the eviction handler locks the range
       [old value for state->start, state->end]. Shortly after, when
       calling clear_extent_bit(), it unlocks the range
       [new value for state->start, state->end], so it ends up unlocking
       only part of the range that it locked, leaving an extent state
       record in the io_tree that represents the unlocked subrange;
    
    10) The eviction handler loop, in its next iteration, gets the
        extent_state record for the subrange that it did not unlock in the
        previous step and then tries to lock it, resulting in an hang.
    
    So fix this by not using the ->start and ->end fields of an existing
    extent_state record. This is a simple solution, and an alternative
    could be to bump the inode's reference count before submitting each
    read bio and having it dropped in the bio's end io callback. But that
    would be a more invasive/complex change and would not protect against
    other possible places that are not holding a reference on the inode
    as well. Something to consider in the future.
    
    Many thanks to Zygo Blaxell for reporting, in the mailing list, the
    issue, a set of scripts to trigger it and testing this fix.
    Reported-by: default avatarZygo Blaxell <ce3g8jdj@umail.furryterror.org>
    Tested-by: default avatarZygo Blaxell <ce3g8jdj@umail.furryterror.org>
    Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    Signed-off-by: default avatarChris Mason <clm@fb.com>
    6ca07097
inode.c 260 KB