1. 11 Apr, 2014 4 commits
    • Steve Wise's avatar
      RDMA/cxgb4: SQ flush fix · b4e2901c
      Steve Wise authored
      There is a race when moving a QP from RTS->CLOSING where a SQ work
      request could be posted after the FW receives the RDMA_RI/FINI WR.
      The SQ work request will never get processed, and should be completed
      with FLUSHED status.  Function c4iw_flush_sq(), however was dropping
      the oldest SQ work request when in CLOSING or IDLE states, instead of
      completing the pending work request. If that oldest pending work
      request was actually complete and has a CQE in the CQ, then when that
      CQE is proceessed in poll_cq, we'll BUG_ON() due to the inconsistent
      SQ/CQ state.
      
      This is a very small timing hole and has only been hit once so far.
      
      The fix is two-fold:
      
      1) c4iw_flush_sq() MUST always flush all non-completed WRs with FLUSHED
         status regardless of the QP state.
      
      2) In c4iw_modify_rc_qp(), always set the "in error" bit on the queue
         before moving the state out of RTS.  This ensures that the state
         transition will not happen while another thread is in
         post_rc_send(), because set_state() and post_rc_send() both aquire
         the qp spinlock.  Also, once we transition the state out of RTS,
         subsequent calls to post_rc_send() will fail because the "in error"
         bit is set.  I don't think this fully closes the race where the FW
         can get a FINI followed a SQ work request being posted (because
         they are posted to differente EQs), but the #1 fix will handle the
         issue by flushing the SQ work request.
      Signed-off-by: default avatarSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: default avatarRoland Dreier <roland@purestorage.com>
      b4e2901c
    • Steve Wise's avatar
      RDMA/cxgb4: rmb() after reading valid gen bit · def4771f
      Steve Wise authored
      Some HW platforms can reorder read operations, so we must rmb() after
      we see a valid gen bit in a CQE but before we read any other fields
      from the CQE.
      Signed-off-by: default avatarSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: default avatarRoland Dreier <roland@purestorage.com>
      def4771f
    • Steve Wise's avatar
      RDMA/cxgb4: Endpoint timeout fixes · b33bd0cb
      Steve Wise authored
      1) timedout endpoint processing can be starved. If there are continual
         CPL messages flowing into the driver, the endpoint timeout
         processing can be starved.  This condition exposed the other bugs
         below.
      
      Solution: In process_work(), call process_timedout_eps() after each CPL
      is processed.
      
      2) Connection events can be processed even though the endpoint is on
         the timeout list.  If the endpoint is scheduled for timeout
         processing, then we must ignore MPA Start Requests and Replies.
      
      Solution: Change stop_ep_timer() to return 1 if the ep has already been
      queued for timeout processing.  All the callers of stop_ep_timer() need
      to check this and act accordingly.  There are just a few cases where
      the caller needs to do something different if stop_ep_timer() returns 1:
      
      1) in process_mpa_reply(), ignore the reply and  process_timeout()
         will abort the connection.
      
      2) in process_mpa_request, ignore the request and process_timeout()
         will abort the connection.
      
      It is ok for callers of stop_ep_timer() to abort the connection since
      that will leave the state in ABORTING or DEAD, and process_timeout()
      now ignores timeouts when the ep is in these states.
      
      3) Double insertion on the timeout list.  Since the endpoint timers
         are used for connection setup and teardown, we need to guard
         against the possibility that an endpoint is already on the timeout
         list.  This is a rare condition and only seen under heavy load and
         in the presense of the above 2 bugs.
      
      Solution: In ep_timeout(), don't queue the endpoint if it is already on
      the queue.
      Signed-off-by: default avatarSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: default avatarRoland Dreier <roland@purestorage.com>
      b33bd0cb
    • Steve Wise's avatar
      RDMA/cxgb4: Use the BAR2/WC path for kernel QPs and T5 devices · fa658a98
      Steve Wise authored
      Signed-off-by: default avatarSteve Wise <swise@opengridcomputing.com>
      
      [ Fix cast from u64* to integer.  - Roland ]
      Signed-off-by: default avatarRoland Dreier <roland@purestorage.com>
      fa658a98
  2. 03 Apr, 2014 36 commits