Commit a481e97d authored by Stefan Richter's avatar Stefan Richter

firewire: sbp2: fix stall with "Unsolicited response"

Fix I/O stalls with some 4-bay RAID enclosures which are based on
OXUF936QSE:
  - Onnto dataTale RSM4QO, old firmware (not anymore with current
    firmware),
  - inXtron Hydra Super-S LCM, old as well as current firmware
when used in RAID-5 mode, perhaps also in other RAID modes.

The stalls happen during heavy or moderate disk traffic in periods that
are a multiple of 5 minutes, roughly twice per hour.  They are caused
by the target responding too late to an ORB_Pointer register write:
The target responds after Split_Timeout, hence firewire-core cancels
the transaction, and firewire-sbp2 fails the SCSI request.  The SCSI
core retries the request, that fails again (and again), hence SCSI core
calls firewire-sbp2's abort handler (and even the Management_Agent
register write in the abort handler has the transaction timeout
problem).

During all that, the process which issued the I/O is stalled in I/O
wait state.

Meanwhile, the target actually acts on the first failed SCSI request:
It responds to the ORB_Pointer write later (seen in the kernel log as
"firewire_core: Unsolicited response") and also finishes the SCSI
request with proper status (seen in the kernel log as "firewire_sbp2:
status write for unknown orb").

So let's just ignore RCODE_CANCELLED in the transaction callback and
wait for the target to complete the ORB nevertheless.  This requires
a small modification is sbp2_cancel_orbs(); it now needs to call
orb->callback() regardless whether fw_cancel_transaction() found the
transaction unfinished or finished.

A different solution is to increase Split_Timeout on the local node.
(Tested: 2000ms timeout; maybe 1000ms or something like that works too.
200ms is insufficient.  Standard is 100ms.)  However, I rather not do
this because any software on any node could change the Split_Timeout to
something unsuitable.  Or such a large Split_Timeout may be undesirable
for other purposes.
Signed-off-by: default avatarStefan Richter <stefanr@s5r6.in-berlin.de>
parent 6c74340b
...@@ -472,12 +472,18 @@ static void complete_transaction(struct fw_card *card, int rcode, ...@@ -472,12 +472,18 @@ static void complete_transaction(struct fw_card *card, int rcode,
* So this callback only sets the rcode if it hasn't already * So this callback only sets the rcode if it hasn't already
* been set and only does the cleanup if the transaction * been set and only does the cleanup if the transaction
* failed and we didn't already get a status write. * failed and we didn't already get a status write.
*
* Here we treat RCODE_CANCELLED like RCODE_COMPLETE because some
* OXUF936QSE firmwares occasionally respond after Split_Timeout and
* complete the ORB just fine. Note, we also get RCODE_CANCELLED
* from sbp2_cancel_orbs() if fw_cancel_transaction() == 0.
*/ */
spin_lock_irqsave(&card->lock, flags); spin_lock_irqsave(&card->lock, flags);
if (orb->rcode == -1) if (orb->rcode == -1)
orb->rcode = rcode; orb->rcode = rcode;
if (orb->rcode != RCODE_COMPLETE) {
if (orb->rcode != RCODE_COMPLETE && orb->rcode != RCODE_CANCELLED) {
list_del(&orb->link); list_del(&orb->link);
spin_unlock_irqrestore(&card->lock, flags); spin_unlock_irqrestore(&card->lock, flags);
...@@ -526,8 +532,7 @@ static int sbp2_cancel_orbs(struct sbp2_logical_unit *lu) ...@@ -526,8 +532,7 @@ static int sbp2_cancel_orbs(struct sbp2_logical_unit *lu)
list_for_each_entry_safe(orb, next, &list, link) { list_for_each_entry_safe(orb, next, &list, link) {
retval = 0; retval = 0;
if (fw_cancel_transaction(device->card, &orb->t) == 0) fw_cancel_transaction(device->card, &orb->t);
continue;
orb->rcode = RCODE_CANCELLED; orb->rcode = RCODE_CANCELLED;
orb->callback(orb, NULL); orb->callback(orb, NULL);
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment