• Kevin Barnett's avatar
    scsi: smartpqi: correct lun reset issues · ca8ad9bc
    Kevin Barnett authored
    [ Upstream commit 2ba55c98 ]
    
    Problem:
    The Linux kernel takes a logical volume offline after a LUN reset.  This is
    generally accompanied by this message in the dmesg output:
    
    Device offlined - not ready after error recovery
    
    Root Cause:
    The root cause is a "quirk" in the timeout handling in the Linux SCSI
    layer. The Linux kernel places a 30-second timeout on most media access
    commands (reads and writes) that it send to device drivers.  When a media
    access command times out, the Linux kernel goes into error recovery mode
    for the LUN that was the target of the command that timed out. Every
    command that timed out is kept on a list inside of the Linux kernel to be
    retried later. The kernel attempts to recover the command(s) that timed out
    by issuing a LUN reset followed by a TEST UNIT READY. If the LUN reset and
    TEST UNIT READY commands are successful, the kernel retries the command(s)
    that timed out.
    
    Each SCSI command issued by the kernel has a result field associated with
    it. This field indicates the final result of the command (success or
    error). When a command times out, the kernel places a value in this result
    field indicating that the command timed out.
    
    The "quirk" is that after the LUN reset and TEST UNIT READY commands are
    completed, the kernel checks each command on the timed-out command list
    before retrying it. If the result field is still "timed out", the kernel
    treats that command as not having been successfully recovered for a
    retry. If the number of commands that are in this state are greater than
    two, the kernel takes the LUN offline.
    
    Fix:
    When our RAIDStack receives a LUN reset, it simply waits until all
    outstanding commands complete. Generally, all of these outstanding commands
    complete successfully. Therefore, the fix in the smartpqi driver is to
    always set the command result field to indicate success when a request
    completes successfully. This normally isn’t necessary because the result
    field is always initialized to success when the command is submitted to the
    driver. So when the command completes successfully, the result field is
    left untouched. But in this case, the kernel changes the result field
    behind the driver’s back and then expects the field to be changed by the
    driver as the commands that timed-out complete.
    Reviewed-by: default avatarDave Carroll <david.carroll@microsemi.com>
    Reviewed-by: default avatarScott Teel <scott.teel@microsemi.com>
    Signed-off-by: default avatarKevin Barnett <kevin.barnett@microsemi.com>
    Signed-off-by: default avatarDon Brace <don.brace@microsemi.com>
    Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
    Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
    ca8ad9bc
smartpqi_init.c 200 KB