• Sagi Grimberg's avatar
    nvme: fix ns removal hang when failing to revalidate due to a transient error · 205da243
    Sagi Grimberg authored
    If a controller reset is racing with a namespace revalidation, the
    revalidation (admin) I/O will surely fail, but we should not remove the
    namespace as we will execute the I/O when the controller is back up.
    Same for spurious allocation errors (return -ENOMEM).
    
    Fix this by checking the specific error code in nvme_revalidate_disk and
    if it is a transient error (for example non DNR nvme statuses or
    a negative ENOMEM as allocation failure), do not remove the namespace as
    it will either recover when the controller is back up and schedule
    a subsequent scan, or the controller is going away and the namespaces
    will be removed anyways.
    
    This fixes a hang namespace scanning racing with a controller reset and
    also sporious I/O errors in path failover coditions where the
    controller reset is racing with the namespace scan work with multipath
    enabled.
    Reported-by: default avatarHannes Reinecke  <hare@suse.de>
    Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
    Reviewed-by: default avatarJames Smart <james.smart@broadcom.com>
    Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
    Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
    205da243
core.c 103 KB