• Yufen Yu's avatar
    md/raid1: exit sync request if MD_RECOVERY_INTR is set · 8c242593
    Yufen Yu authored
    We met a sync thread stuck as follows:
    
     raid1_sync_request+0x2c9/0xb50
     md_do_sync+0x983/0xfa0
     md_thread+0x11c/0x160
     kthread+0x111/0x130
     ret_from_fork+0x35/0x40
     0xffffffffffffffff
    
    At the same time, there is a stuck mdadm thread (mdadm --manage
    /dev/md2 --add /dev/sda). It is trying to stop the sync thread:
    
     kthread_stop+0x42/0xf0
     md_unregister_thread+0x3a/0x70
     md_reap_sync_thread+0x15/0x160
     action_store+0x142/0x2a0
     md_attr_store+0x6c/0xb0
     kernfs_fop_write+0x102/0x180
     __vfs_write+0x33/0x170
     vfs_write+0xad/0x1a0
     SyS_write+0x52/0xc0
     do_syscall_64+0x6e/0x190
     entry_SYSCALL_64_after_hwframe+0x3d/0xa2
    
    Debug tools show that the sync thread is waiting in raise_barrier(),
    until raid1d() end all normal IO bios into bio_end_io_list(introduced
    in commit 55ce74d4). But, raid1d() cannot end these bios if
    MD_CHANGE_PENDING bit is set. It needs to get mddev->reconfig_mutex lock
    and then clear the bit in md_check_recovery().
    However, the lock is holding by mdadm in action_store().
    
    Thus, there is a loop:
    mdadm waiting for sync thread to stop, sync thread waiting for
    raid1d() to end bios, raid1d() waiting for mdadm to release
    mddev->reconfig_mutex lock and then it can end bios.
    
    Fix this by checking MD_RECOVERY_INTR while waiting in raise_barrier(),
    so that sync thread can exit while mdadm is stoping the sync thread.
    
    Fixes: 55ce74d4 ("md/raid1: ensure device failure recorded before write request returns.")
    Signed-off-by: default avatarJason Yan <yanaijie@huawei.com>
    Signed-off-by: default avatarYufen Yu <yuyufen@huawei.com>
    Signed-off-by: default avatarShaohua Li <shli@fb.com>
    8c242593
raid1.c 91.3 KB