• Ben Hutchings's avatar
    sfc: Fix reset vs probe/remove/PM races involving efx_nic::state · 7153f623
    Ben Hutchings authored
    We try to defer resets while the device is not READY, but we're not
    doing this quite correctly.  In particular, changes to efx_nic::state
    are documented as serialised by the RTNL lock, but they aren't.
    
    1. We check whether a reset was requested during probe (suggesting
    broken hardware) before we allow requested resets to be scheduled.
    This leaves a window where a requested reset would be deferred
    indefinitely.
    
    2. Although we cancel the reset work item during device removal,
    there are still later operations that can cause it to be scheduled
    again.  We need to check the state before scheduling it.
    
    3. Since the state can change between scheduling and running of
    the work item, we still need to check it there, and we need to
    do so *after* acquiring the RTNL lock which serialises state
    changes.
    
    4. We must cancel the reset work item during device removal, if the
    state could ever have been READY.  This wasn't done in some of the
    failure paths from efx_pci_probe().  Move the cancellation to
    efx_pci_remove_main().
    Signed-off-by: default avatarBen Hutchings <bhutchings@solarflare.com>
    7153f623
efx.c 75.7 KB