1. 19 Aug, 2010 2 commits
    • Stefan Richter's avatar
      firewire: sbp2: fix stall with "Unsolicited response" · a481e97d
      Stefan Richter authored
      Fix I/O stalls with some 4-bay RAID enclosures which are based on
      OXUF936QSE:
        - Onnto dataTale RSM4QO, old firmware (not anymore with current
          firmware),
        - inXtron Hydra Super-S LCM, old as well as current firmware
      when used in RAID-5 mode, perhaps also in other RAID modes.
      
      The stalls happen during heavy or moderate disk traffic in periods that
      are a multiple of 5 minutes, roughly twice per hour.  They are caused
      by the target responding too late to an ORB_Pointer register write:
      The target responds after Split_Timeout, hence firewire-core cancels
      the transaction, and firewire-sbp2 fails the SCSI request.  The SCSI
      core retries the request, that fails again (and again), hence SCSI core
      calls firewire-sbp2's abort handler (and even the Management_Agent
      register write in the abort handler has the transaction timeout
      problem).
      
      During all that, the process which issued the I/O is stalled in I/O
      wait state.
      
      Meanwhile, the target actually acts on the first failed SCSI request:
      It responds to the ORB_Pointer write later (seen in the kernel log as
      "firewire_core: Unsolicited response") and also finishes the SCSI
      request with proper status (seen in the kernel log as "firewire_sbp2:
      status write for unknown orb").
      
      So let's just ignore RCODE_CANCELLED in the transaction callback and
      wait for the target to complete the ORB nevertheless.  This requires
      a small modification is sbp2_cancel_orbs(); it now needs to call
      orb->callback() regardless whether fw_cancel_transaction() found the
      transaction unfinished or finished.
      
      A different solution is to increase Split_Timeout on the local node.
      (Tested: 2000ms timeout; maybe 1000ms or something like that works too.
      200ms is insufficient.  Standard is 100ms.)  However, I rather not do
      this because any software on any node could change the Split_Timeout to
      something unsuitable.  Or such a large Split_Timeout may be undesirable
      for other purposes.
      Signed-off-by: default avatarStefan Richter <stefanr@s5r6.in-berlin.de>
      a481e97d
    • Stefan Richter's avatar
      firewire: sbp2: fix memory leak in sbp2_cancel_orbs or at send error · 6c74340b
      Stefan Richter authored
      When an ORB was canceled (Command ORB i.e. SCSI request timed out, or
      Management ORB timed out), or there was a send error in the initial
      transaction, we missed to drop one of the ORB's references and thus
      leaked memory.
      
      Background:
      In total, we hold 3 references to each Operation Request Block:
        - 1 during sbp2_scsi_queuecommand() or sbp2_send_management_orb()
          respectively,
        - 1 for the duration of the write transaction to the ORB_Pointer or
          Management_Agent register of the target,
        - 1 for as long as the ORB stays within the lu->orb_list, until
          the ORB is unlinked from the list and the orb->callback was
          executed.
      
      The latter one of these 3 references is finished
        - normally by sbp2_status_write() when the target wrote status
          for a pending ORB,
        - or by sbp2_cancel_orbs() in case of an ORB time-out,
        - or by complete_transaction() in case of a send error.
      Of them, the latter two lacked the kref_put.
      
      Add the missing kref_put()s.  Add comments to the gets and puts of
      references for transaction callbacks and ORB callbacks so that it is
      easier to see what is supposed to happen.
      Signed-off-by: default avatarStefan Richter <stefanr@s5r6.in-berlin.de>
      6c74340b
  2. 05 Aug, 2010 1 commit
  3. 02 Aug, 2010 2 commits
  4. 01 Aug, 2010 2 commits
  5. 31 Jul, 2010 5 commits
  6. 30 Jul, 2010 8 commits
  7. 29 Jul, 2010 20 commits