• James Smart's avatar
    nvme: stop aer posting if controller state not live · cd48282c
    James Smart authored
    If an nvme async_event command completes, in most cases, a new
    async event is posted. However, if the controller enters a
    resetting or reconnecting state, there is nothing to block the
    scheduled work element from posting the async event again. Nor are
    there calls from the transport to stop async events when an
    association dies.
    
    In the case of FC, where the association is torn down, the aer must
    be aborted on the FC link and completes through the normal job
    completion path. Thus the terminated async event ends up being
    rescheduled even though the controller isn't in a valid state for
    the aer, and the reposting gets the transport into a partially torn
    down data structure.
    
    It's possible to hit the scenario on rdma, although much less likely
    due to an aer completing right as the association is terminated and
    as the association teardown reclaims the blk requests via
    nvme_cancel_request() so its immediate, not a link-related action
    like on FC.
    
    Fix by putting controller state checks in both the async event
    completion routine where it schedules the async event and in the
    async event work routine before it calls into the transport. It's
    effectively a "stop_async_events()" behavior.  The transport, when
    it creates a new association with the subsystem will transition
    the state back to live and is already restarting the async event
    posting.
    Signed-off-by: default avatarJames Smart <james.smart@broadcom.com>
    [hch: remove taking a lock over reading the controller state]
    Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
    Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    cd48282c
core.c 73.9 KB