• Mike Christie's avatar
    scsi: core: sysfs: Fix hang when device state is set via sysfs · 4edd8cd4
    Mike Christie authored
    This fixes a regression added with:
    
    commit f0f82e24 ("scsi: core: Fix capacity set to zero after
    offlinining device")
    
    The problem is that after iSCSI recovery, iscsid will call into the kernel
    to set the dev's state to running, and with that patch we now call
    scsi_rescan_device() with the state_mutex held. If the SCSI error handler
    thread is just starting to test the device in scsi_send_eh_cmnd() then it's
    going to try to grab the state_mutex.
    
    We are then stuck, because when scsi_rescan_device() tries to send its I/O
    scsi_queue_rq() calls -> scsi_host_queue_ready() -> scsi_host_in_recovery()
    which will return true (the host state is still in recovery) and I/O will
    just be requeued. scsi_send_eh_cmnd() will then never be able to grab the
    state_mutex to finish error handling.
    
    To prevent the deadlock move the rescan-related code to after we drop the
    state_mutex.
    
    This also adds a check for if we are already in the running state. This
    prevents extra scans and helps the iscsid case where if the transport class
    has already onlined the device during its recovery process then we don't
    need userspace to do it again plus possibly block that daemon.
    
    Link: https://lore.kernel.org/r/20211105221048.6541-3-michael.christie@oracle.com
    Fixes: f0f82e24 ("scsi: core: Fix capacity set to zero after offlinining device")
    Cc: Bart Van Assche <bvanassche@acm.org>
    Cc: lijinlin <lijinlin3@huawei.com>
    Cc: Wu Bo <wubo40@huawei.com>
    Reviewed-by: default avatarLee Duncan <lduncan@suse.com>
    Reviewed-by: default avatarWu Bo <wubo40@huawei.com>
    Signed-off-by: default avatarMike Christie <michael.christie@oracle.com>
    Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
    4edd8cd4
scsi_sysfs.c 42.2 KB