• James Smart's avatar
    nvme-fc: reject reconnect if io queue count is reduced to zero · 834d3710
    James Smart authored
    If:
    
     - A successful connect has occurred with an io queue count greater than
       zero and namespaces detected and running.
     - An error or something occurs which causes a termination of the prior
       association and then starts a reconnect,
     - The reconnect then creates a new controller, but for whatever reason,
       nvme_set_queue_count() results in io queue count set to zero.  This
       will skip io queue and tag set changes.
     - But... the controller will transition to live, calling
       nvme_start_ctrl, which calls nvme_start_queues(), which then releases
       I/Os into the transport which then sends them to the driver.
    
    As there are no queues, things eventually hit the driver looking for a
    handle, which was cleared when the original controller was reset, and it
    can't proceed. Worst case, things progress, but everything fails.
    
    In the failing scenario, the nvme_set_features(NVME_FEAT_NUM_QUEUES)
    command actually failed with a NVME_SC_INTERNAL error.  For some reason,
    although nvme_set_queue_count() saw the error and set io queue count to
    zero, it doesn't return a failure status to the transport, which allows
    the transport to continue using the controller.
    
    Fix the problem by simply rejecting the new association if at least 1
    I/O queue can't be created. The association reject will fail the
    reconnect attempt and fall into the reconnect retry policy.
    Signed-off-by: default avatarJames Smart <jsmart2021@gmail.com>
    Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
    Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
    Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    834d3710
fc.c 91.1 KB