• James Smart's avatar
    [SCSI] fc transport: resolve scan vs delete deadlocks · a0785edf
    James Smart authored
    In a prior posting to linux-scsi on the fc transport and workq
    deadlocks, we noted a second error that did not have a patch:
      http://marc.theaimsgroup.com/?l=linux-scsi&m=114467847711383&w=2
      - There's a deadlock where scsi_remove_target() has to sit behind
        scsi_scan_target() due to contention over the scan_lock().
    
    Subsequently we posted a request for comments about the deadlock:
      http://marc.theaimsgroup.com/?l=linux-scsi&m=114469358829500&w=2
    
    This posting resolves the second error. Here's what we now understand,
    and are implementing:
    
      If the lldd deletes the rport while a scan is active, the sdev's queue
      is blocked which stops the issuing of commands associated with the scan.
      At this point, the scan stalls, and does so with the shost->scan_mutex held.
      If, at this point, if any scan or delete request is made on the host, it
      will stall waiting for the scan_mutex.
    
      For the FC transport, we queue all delete work to a single workq.
      So, things worked fine when competing with the scan, as long as the
      target blocking the scan was the same target at the top of our delete
      workq, as the delete workq routine always unblocked just prior to
      requesting the delete.  Unfortunately, if the top of our delete workq
      was for a different target, we deadlock.  Additionally, if the target
      blocking scan returned, we were unblocking it in the scan workq routine,
      which really won't execute until the existing stalled scan workq
      completes (e.g. we're re-scheduling it while it is in the midst of its
      execution).
    
      This patch moves the unblock out of the workq routines and moves it to
      the context that is scheduling the work. This ensures that at some point,
      we will unblock the target that is blocking scan.  Please note, however,
      that the deadlock condition may still occur while it waits for the
      transport to timeout an unblock on a target.  Worst case, this is bounded
      by the transport dev_loss_tmo (default: 30 seconds).
    
    Finally, Michael Reed deserves the credit for the bulk of this patch,
    analysis, and it's testing. Thank you for your help.
    
    Note: The request for comments statements about the gross-ness of the
      scan_mutex still stand.
    Signed-off-by: default avatarMichael Reed <mdr@sgi.com>
    Signed-off-by: default avatarJames Smart <james.smart@emulex.com>
    Signed-off-by: default avatarJames Bottomley <James.Bottomley@SteelEye.com>
    a0785edf
scsi_transport_fc.c 63.1 KB