1. 15 Jul, 2008 40 commits
    • Eli Cohen's avatar
      IPoIB: Double default RX/TX ring sizes · bc3a290b
      Eli Cohen authored
      Increase IPoIB ring sizes to twice their original sizes (RX: 128->256,
      TX: 64->128) to act as a shock absorber for high traffic peaks.  With
      the current settings, we have seen cases that there are many calls to
      netif_stop_queue(), which causes degradation in throughput.  Also,
      larger receive buffer sizes help IPoIB in CM mode to avoid experiencing
      RNR NAK conditions due to insufficient receive buffers at the SRQ.
      Signed-off-by: default avatarEli Cohen <eli@mellanox.co.il>
      Signed-off-by: default avatarRoland Dreier <rolandd@cisco.com>
      bc3a290b
    • Eli Cohen's avatar
      IPoIB/cm: Reduce connected mode TX object size · e112373f
      Eli Cohen authored
      Since IPoIB connected mode does not NETIF_F_SG, we only have one DMA
      mapping per send, so we don't need a mapping[] array.  Define a new
      struct with a single u64 mapping member and use it for the CM tx_ring.
      Signed-off-by: default avatarEli Cohen <eli@mellanox.co.il>
      Signed-off-by: default avatarRoland Dreier <rolandd@cisco.com>
      e112373f
    • Ralph Campbell's avatar
      IB/ipath: Use IEEE OUI for vendor_id reported by ibv_query_device() · df866619
      Ralph Campbell authored
      The IB spe. for SubnGet(NodeInfo) and query HCA says that the vendor
      ID field should be the IEEE OUI assigned to the vendor.  The ipath
      driver was returning the PCI vendor ID instead.  This will affect
      applications which call ibv_query_device().  The old value was
      0x001fc1 or 0x001077, the new value is 0x001175.
      
      The vendor ID doesn't appear to be exported via /sys so that should
      reduce possible compatibility issues.  I'm only aware of Open MPI as a
      major application which depends on this change, and they have made
      necessary adjustments.
      Signed-off-by: default avatarRalph Campbell <ralph.campbell@qlogic.com>
      Signed-off-by: default avatarRoland Dreier <rolandd@cisco.com>
      df866619
    • Eli Cohen's avatar
      IPoIB: Use dev_set_mtu() to change mtu · bd360671
      Eli Cohen authored
      When the driver sets the MTU of the net device outside of its
      change_mtu method, it should make use of dev_set_mtu() instead of
      directly setting the mtu field of struct netdevice.  Otherwise
      functions registered to be called upon MTU change will not get called
      (this is done through call_netdevice_notifiers() in dev_set_mtu()).
      Signed-off-by: default avatarEli Cohen <eli@mellanox.co.il>
      Signed-off-by: default avatarRoland Dreier <rolandd@cisco.com>
      bd360671
    • Eli Cohen's avatar
      IPoIB: Use rtnl lock/unlock when changing device flags · c8c2afe3
      Eli Cohen authored
      Use of this lock is required to synchronize changes to the netdvice's
      data structs.  Also move the call to ipoib_flush_paths() after the
      modification of the netdevice flags in set_mode().
      Signed-off-by: default avatarEli Cohen <eli@mellanox.co.il>
      Signed-off-by: default avatarRoland Dreier <rolandd@cisco.com>
      c8c2afe3
    • Roland Dreier's avatar
      IPoIB: Get rid of ipoib_mcast_detach() wrapper · 9eae554c
      Roland Dreier authored
      ipoib_mcast_detach() does nothing except call ib_detach_mcast(), so just
      use the core API in the one place that does a multicast group detach.
      
      add/remove: 0/1 grow/shrink: 0/1 up/down: 0/-105 (-105)
      function                                     old     new   delta
      ipoib_mcast_leave                            357     319     -38
      ipoib_mcast_detach                            67       -     -67
      Signed-off-by: default avatarRoland Dreier <rolandd@cisco.com>
      9eae554c
    • Eli Cohen's avatar
      IPoIB: Only set Q_Key once: after joining broadcast group · d0de1362
      Eli Cohen authored
      The current code will set the Q_Key for any join of a non-sendonly
      multicast group.  The operation involves a modify QP operation, which
      is fairly heavyweight, and is only really required after the join of
      the broadcast group.  Fix this by adding a parameter to ipoib_mcast_attach()
      to control when the Q_Key is set.
      Signed-off-by: default avatarEli Cohen <eli@mellanox.co.il>
      Signed-off-by: default avatarRoland Dreier <rolandd@cisco.com>
      d0de1362
    • Eli Cohen's avatar
      IPoIB: Remove priv->mcast_mutex · 5892eff9
      Eli Cohen authored
      No need for a mutex around calls to ib_attach_mcast/ib_detach_mcast
      since these operations are synchronized at the HW driver layer.
      Signed-off-by: default avatarEli Cohen <eli@mellanox.co.il>
      Signed-off-by: default avatarRoland Dreier <rolandd@cisco.com>
      5892eff9
    • Eli Cohen's avatar
      IPoIB: Remove unused IPOIB_MCAST_STARTED code · c03d4731
      Eli Cohen authored
      The IPOIB_MCAST_STARTED flag is not used at all since commit b3e2749b
      ("IPoIB: Don't drop multicast sends when they can be queued"), so
      remove it.
      Signed-off-by: default avatarEli Cohen <eli@mellanox.co.il>
      Signed-off-by: default avatarRoland Dreier <rolandd@cisco.com>
      c03d4731
    • Steve Wise's avatar
    • Roland Dreier's avatar
      RDMA/nes: Get rid of ring_doorbell parameter of nes_post_cqp_request() · 8294f297
      Roland Dreier authored
      Every caller of nes_post_cqp_request() passed it NES_CQP_REQUEST_RING_DOORBELL,
      so just remove that parameter and always ring the doorbell.
      Signed-off-by: default avatarRoland Dreier <rolandd@cisco.com>
      Acked-by: default avatarFaisal Latif <flatif@neteffect.com>
      8294f297
    • Jon Mason's avatar
      RDMA/cxgb3: Propagate HW page size capabilities · 52c8084b
      Jon Mason authored
      cxgb3 does not currently report the page size capabilities, and
      incorrectly reports them internally.
      
      This version changes the bit-shifting to a static value (per Steve's
      request).
      Signed-off-by: default avatarJon Mason <jon@opengridcomputing.com>
      Acked-by: default avatarSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: default avatarRoland Dreier <rolandd@cisco.com>
      52c8084b
    • Roland Dreier's avatar
      RDMA/nes: Encapsulate logic nes_put_cqp_request() · 1ff66e8c
      Roland Dreier authored
      The iw_nes driver repeats the logic
      
      	if (atomic_dec_and_test(&cqp_request->refcount)) {
      		if (cqp_request->dynamic) {
      			kfree(cqp_request);
      		} else {
      			spin_lock_irqsave(&nesdev->cqp.lock, flags);
      			list_add_tail(&cqp_request->list, &nesdev->cqp_avail_reqs);
      			spin_unlock_irqrestore(&nesdev->cqp.lock, flags);
      		}
      	}
      
      over and over.  Wrap this up in functions nes_free_cqp_request() and
      nes_put_cqp_request() to simplify such code.
      
      In addition to making the source smaller and more readable, this shrinks
      the compiled code quite a bit:
      
      add/remove: 2/0 grow/shrink: 0/13 up/down: 164/-1692 (-1528)
      function                                     old     new   delta
      nes_free_cqp_request                           -     147    +147
      nes_put_cqp_request                            -      17     +17
      nes_modify_qp                               2316    2293     -23
      nes_hw_modify_qp                             737     657     -80
      nes_dereg_mr                                 945     860     -85
      flush_wqes                                   501     416     -85
      nes_manage_apbvt                             648     560     -88
      nes_reg_mr                                  1117    1026     -91
      nes_cqp_ce_handler                           927     769    -158
      nes_alloc_mw                                1052     884    -168
      nes_create_qp                               5314    5141    -173
      nes_alloc_fmr                               2212    2035    -177
      nes_destroy_cq                              1097     918    -179
      nes_create_cq                               2787    2598    -189
      nes_dealloc_mw                               762     566    -196
      Signed-off-by: default avatarRoland Dreier <rolandd@cisco.com>
      Acked-by: default avatarFaisal Latif <flatif@neteffect.com>
      1ff66e8c
    • Moni Shoua's avatar
      IPoIB: Refresh paths instead of flushing them on SM change events · ee1e2c82
      Moni Shoua authored
      The patch tries to solve the problem of device going down and paths being
      flushed on an SM change event. The method is to mark the paths as candidates for
      refresh (by setting the new valid flag to 0), and wait for an ARP
      probe a new path record query.
      
      The solution requires a different and less intrusive handling of SM
      change event. For that, the second argument of the flush function
      changes its meaning from a boolean flag to a level.  In most cases, SM
      failover doesn't cause LID change so traffic won't stop.  In the rare
      cases of LID change, the remote host (the one that hadn't changed its
      LID) will lose connectivity until paths are refreshed. This is no
      worse than the current state.  In fact, preventing the device from
      going down saves packets that otherwise would be lost.
      Signed-off-by: default avatarMoni Levy <monil@voltaire.com>
      Signed-off-by: default avatarMoni Shoua <monis@voltaire.com>
      Signed-off-by: default avatarRoland Dreier <rolandd@cisco.com>
      ee1e2c82
    • Joachim Fenkes's avatar
      IB/ehca: Make device table externally visible · 038919f2
      Joachim Fenkes authored
      This gives ehca an autogenerated modalias and therefore enables automatic loading.
      Signed-off-by: default avatarJoachim Fenkes <fenkes@de.ibm.com>
      Signed-off-by: default avatarRoland Dreier <rolandd@cisco.com>
      038919f2
    • Vladimir Sokolovsky's avatar
      IPoIB: add LRO support · af40da89
      Vladimir Sokolovsky authored
      Add "ipoib_use_lro" module parameter to enable LRO and an
      "ipoib_lro_max_aggr" module parameter to set the max number of packets
      to be aggregated.  Make LRO controllable and LRO statistics accessible
      through ethtool.
      Signed-off-by: default avatarVladimir Sokolovsky <vlad@mellanox.co.il>
      Signed-off-by: default avatarEli Cohen <eli@mellanox.co.il>
      Signed-off-by: default avatarRoland Dreier <rolandd@cisco.com>
      af40da89
    • Ron Livne's avatar
      IPoIB: Use multicast loopback blocking if available · 12406734
      Ron Livne authored
      Set IB_QP_CREATE_BLOCK_MULTICAST_LOOPBACK for IPoIB's UD QPs if
      supported by the underlying device.  This creates an improvement of up
      to 39% in bandwidth when sending multicast packets with IPoIB, and an
      improvment of 12% in cpu usage.
      Signed-off-by: default avatarRon Livne <ronli@voltaire.com>
      Signed-off-by: default avatarRoland Dreier <rolandd@cisco.com>
      12406734
    • Ron Livne's avatar
      IB/mlx4: Add support for blocking multicast loopback packets · 521e575b
      Ron Livne authored
      Add support for handling the IB_QP_CREATE_MULTICAST_BLOCK_LOOPBACK
      flag by using the per-multicast group loopback blocking feature of
      mlx4 hardware.
      Signed-off-by: default avatarRon Livne <ronli@voltaire.com>
      Signed-off-by: default avatarRoland Dreier <rolandd@cisco.com>
      521e575b
    • Ron Livne's avatar
      IB/core: Add support for multicast loopback blocking · 47ee1b9f
      Ron Livne authored
      This patch also adds a creation flag for QPs,
      IB_QP_CREATE_MULTICAST_BLOCK_LOOPBACK, which when set means that
      multicast sends from the QP to a group that the QP is attached to will
      not be looped back to the QP's receive queue.  This can be used to
      save receive resources when a consumer does not want a local copy of
      multicast traffic; for example IPoIB must waste CPU time throwing away
      such local copies of multicast traffic.
      
      This patch also adds a device capability flag that shows whether a
      device supports this feature or not.
      Signed-off-by: default avatarRon Livne <ronli@voltaire.com>
      Signed-off-by: default avatarRoland Dreier <rolandd@cisco.com>
      47ee1b9f
    • Steve Wise's avatar
      RDMA/cxgb3: Add support for protocol statistics · 14cc180f
      Steve Wise authored
      - Add a new rdma ctl command called RDMA_GET_MIB to the cxgb3 low
        level driver to obtain the protocol mib from the rnic hardware.
      
      - Add new iw_cxgb3 provider method to get the MIB from the low level
        driver.
      Signed-off-by: default avatarSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: default avatarRoland Dreier <rolandd@cisco.com>
      14cc180f
    • Steve Wise's avatar
      RDMA/core: Add iWARP protocol statistics attributes in sysfs · 7f624d02
      Steve Wise authored
      This patch adds a sysfs attribute group called "proto_stats" under
      /sys/class/infiniband/$device/ and populates this group with protocol
      statistics if they exist for a given device.  Currently, only iWARP
      stats are defined, but the code is designed to allow InfiniBand
      protocol stats if they become available.  These stats are per-device
      and more importantly -not- per port.
      
      Details:
      
      - Add union rdma_protocol_stats in ib_verbs.h.  This union allows
        defining transport-specific stats.  Currently only iwarp stats are
        defined.
      
      - Add struct iw_protocol_stats to define the current set of iwarp
        protocol stats.
      
      - Add new ib_device method called get_proto_stats() to return protocol
        statistics.
      
      - Add logic in core/sysfs.c to create iwarp protocol stats attributes
        if the device is an RNIC and has a get_proto_stats() method.
      Signed-off-by: default avatarSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: default avatarRoland Dreier <rolandd@cisco.com>
      7f624d02
    • Roland Dreier's avatar
      IPoIB/cm: Fix racy use of receive WR/SGL in ipoib_cm_post_receive_nonsrq() · a7d834c4
      Roland Dreier authored
      For devices that don't support SRQs, ipoib_cm_post_receive_nonsrq() is
      called from both ipoib_cm_handle_rx_wc() and ipoib_cm_nonsrq_init_rx(),
      and these two callers are not synchronized against each other.
      However, ipoib_cm_post_receive_nonsrq() always reuses the same receive
      work request and scatter list structures, so multiple callers can end
      up stepping on each other, which leads to posting garbled work
      requests.
      
      Fix this by having the caller pass in the ib_recv_wr and ib_sge
      structures to use, and allocating new local structures in
      ipoib_cm_nonsrq_init_rx().
      
      Based on a patch by Pradeep Satyanarayana <pradeep@us.ibm.com> and
      David Wilder <dwilder@us.ibm.com>, with debugging help from Hoang-Nam
      Nguyen <hnguyen@de.ibm.com>.
      Signed-off-by: default avatarRoland Dreier <rolandd@cisco.com>
      a7d834c4
    • Roland Dreier's avatar
      468f2239
    • Roland Dreier's avatar
      RDMA/cxgb3: Remove write-only iwch_rnic_attributes fields · eec8845d
      Roland Dreier authored
      The members struct iwch_rnic_attributes.vendor_id and .vendor_part_id
      are write-only, so we might as well get rid of them.
      Signed-off-by: default avatarRoland Dreier <rolandd@cisco.com>
      Acked-by: default avatarSteve Wise <swise@opengridcomputing.com>
      eec8845d
    • Steve Wise's avatar
      RDMA/cxgb3: Fix up some ib_device_attr fields · 97d1cc80
      Steve Wise authored
      - set fw_ver
      - set hw_ver
      - set max_qp_wr to something reasonable
      - set max_cqe to something reasonable
      Signed-off-by: default avatarSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: default avatarRoland Dreier <rolandd@cisco.com>
      97d1cc80
    • Stefan Roscher's avatar
      IB/ehca: In case of lost interrupts, trigger EOI to reenable interrupts · 6f7bc01a
      Stefan Roscher authored
      During corner case testing, we noticed that some versions of ehca do
      not properly transition to interrupt done in special load situations.
      This can be resolved by periodically triggering EOI through H_EOI, if
      EQEs are pending.
      Signed-off-by: default avatarStefan Roscher <stefan.roscher@de.ibm.com>
      Acked-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarRoland Dreier <rolandd@cisco.com>
      6f7bc01a
    • Joachim Fenkes's avatar
    • Roland Dreier's avatar
      IB/mlx4: Remove extra code for RESET->ERR QP state transition · 7c27f358
      Roland Dreier authored
      Commit 65adfa91 ("IB/mlx4: Fix RESET to RESET and RESET to ERROR
      transitions") added some extra code to handle a QP state transition
      from RESET to ERROR.  However, the latest 1.2.1 version of the IB spec
      has clarified that this transition is actually not allowed, so we can
      remove this extra code again.
      Signed-off-by: default avatarRoland Dreier <rolandd@cisco.com>
      7c27f358
    • Roland Dreier's avatar
      IB/mthca: Remove extra code for RESET->ERR QP state transition · d3809ad0
      Roland Dreier authored
      Commit b18aad71 ("IB/mthca: Fix RESET to ERROR transition") added some
      extra code to handle a QP state transition from RESET to ERROR.
      However, the latest 1.2.1 version of the IB spec has clarified that
      this transition is actually not allowed, so we can remove this extra
      code again.
      Signed-off-by: default avatarRoland Dreier <rolandd@cisco.com>
      d3809ad0
    • Ralph Campbell's avatar
      IB/core: Reset to error QP state transition is not allowed · e5a5e7d5
      Ralph Campbell authored
      I was reviewing the QP state transition diagram in the IB 1.2.1 spec
      and the code for qp_state_table[], and noticed that the code allows a
      QP to be modified from IB_QPS_RESET to IB_QPS_ERR whereas the notes
      for figure 124 (pg 457) specifically says that this transition isn't
      allowed.  This is a clarification from earlier versions of the IB
      spec, which were ambiguous in this area and suggested that the RESET
      to ERR transition was allowed.
      
      Fix up the qp_state_table[] to make RESET->ERR not allowed.
      Signed-off-by: default avatarRalph Campbell <ralph.campbell@qlogic.com>
      Signed-off-by: default avatarRoland Dreier <rolandd@cisco.com>
      e5a5e7d5
    • Eli Cohen's avatar
      IB/mlx4: Pass congestion management class MADs to the HCA · 6578cf33
      Eli Cohen authored
      ConnectX HCAs support the IB_MGMT_CLASS_CONG_MGMT management class, so
      process MADs of this class through the MAD_IFC firmware command.
      Signed-off-by: default avatarEli Cohen <eli@mellanox.co.il>
      Signed-off-by: default avatarRoland Dreier <rolandd@cisco.com>
      6578cf33
    • Eli Cohen's avatar
      IB/mlx4: Configure QPs' max message size based on real device capability · d1f2cd89
      Eli Cohen authored
      ConnectX returns the max message size it supports through the
      QUERY_DEV_CAP firmware command.  When modifying a QP to RTR, the max
      message size for the QP must be specified.  This value must not exceed
      the value declared through QUERY_DEV_CAP.  The current code ignores
      the max allowed size and unconditionally sets the value to 2^31.  This
      patch sets all QPs to the max value allowed as returned from firmware.
      Signed-off-by: default avatarEli Cohen <eli@mellanox.co.il>
      Signed-off-by: default avatarRoland Dreier <rolandd@cisco.com>
      d1f2cd89
    • Steve Wise's avatar
      RDMA/cxgb3: MEM_MGT_EXTENSIONS support · e7e55829
      Steve Wise authored
      - set IB_DEVICE_MEM_MGT_EXTENSIONS capability bit if fw supports it.
      - set max_fast_reg_page_list_len device attribute.
      - add iwch_alloc_fast_reg_mr function.
      - add iwch_alloc_fastreg_pbl
      - add iwch_free_fastreg_pbl
      - adjust the WQ depth for kernel mode work queues to account for
        fastreg possibly taking 2 WR slots.
      - add fastreg_mr work request support.
      - add local_inv work request support.
      - add send_with_inv and send_with_se_inv work request support.
      - removed useless duplicate enums/defines for TPT/MW/MR stuff.
      Signed-off-by: default avatarSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: default avatarRoland Dreier <rolandd@cisco.com>
      e7e55829
    • Steve Wise's avatar
      RDMA/core: Add memory management extensions support · 00f7ec36
      Steve Wise authored
      This patch adds support for the IB "base memory management extension"
      (BMME) and the equivalent iWARP operations (which the iWARP verbs
      mandates all devices must implement).  The new operations are:
      
       - Allocate an ib_mr for use in fast register work requests.
      
       - Allocate/free a physical buffer lists for use in fast register work
         requests.  This allows device drivers to allocate this memory as
         needed for use in posting send requests (eg via dma_alloc_coherent).
      
       - New send queue work requests:
         * send with remote invalidate
         * fast register memory region
         * local invalidate memory region
         * RDMA read with invalidate local memory region (iWARP only)
      
      Consumer interface details:
      
       - A new device capability flag IB_DEVICE_MEM_MGT_EXTENSIONS is added
         to indicate device support for these features.
      
       - New send work request opcodes IB_WR_FAST_REG_MR, IB_WR_LOCAL_INV,
         IB_WR_RDMA_READ_WITH_INV are added.
      
       - A new consumer API function, ib_alloc_mr() is added to allocate
         fast register memory regions.
      
       - New consumer API functions, ib_alloc_fast_reg_page_list() and
         ib_free_fast_reg_page_list() are added to allocate and free
         device-specific memory for fast registration page lists.
      
       - A new consumer API function, ib_update_fast_reg_key(), is added to
         allow the key portion of the R_Key and L_Key of a fast registration
         MR to be updated.  Consumers call this if desired before posting
         a IB_WR_FAST_REG_MR work request.
      
      Consumers can use this as follows:
      
       - MR is allocated with ib_alloc_mr().
      
       - Page list memory is allocated with ib_alloc_fast_reg_page_list().
      
       - MR R_Key/L_Key "key" field is updated with ib_update_fast_reg_key().
      
       - MR made VALID and bound to a specific page list via
         ib_post_send(IB_WR_FAST_REG_MR)
      
       - MR made INVALID via ib_post_send(IB_WR_LOCAL_INV),
         ib_post_send(IB_WR_RDMA_READ_WITH_INV) or an incoming send with
         invalidate operation.
      
       - MR is deallocated with ib_dereg_mr()
      
       - page lists dealloced via ib_free_fast_reg_page_list().
      
      Applications can allocate a fast register MR once, and then can
      repeatedly bind the MR to different physical block lists (PBLs) via
      posting work requests to a send queue (SQ).  For each outstanding
      MR-to-PBL binding in the SQ pipe, a fast_reg_page_list needs to be
      allocated (the fast_reg_page_list is owned by the low-level driver
      from the consumer posting a work request until the request completes).
      Thus pipelining can be achieved while still allowing device-specific
      page_list processing.
      
      The 32-bit fast register memory key/STag is composed of a 24-bit index
      and an 8-bit key.  The application can change the key each time it
      fast registers thus allowing more control over the peer's use of the
      key/STag (ie it can effectively be changed each time the rkey is
      rebound to a page list).
      Signed-off-by: default avatarSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: default avatarRoland Dreier <rolandd@cisco.com>
      00f7ec36
    • Eli Cohen's avatar
      IPoIB: Copy small received SKBs in connected mode · f89271da
      Eli Cohen authored
      The connected mode implementation in the IPoIB driver has a large
      overhead in the way SKBs are handled in the receive flow.  It usually
      allocates an SKB with as big as was used in the currently received SKB
      and moves unused fragments from the old SKB to the new one. This
      involves a loop on all the remaining fragments and incurs overhead on
      the CPU.  This patch, for small SKBs, allocates an SKB just large
      enough to contain the received data and copies to it the data from the
      received SKB.  The newly allocated SKB is passed to the stack and the
      old SKB is reposted.
      
      When running netperf, UDP small messages, without this pach I get:
      
          UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
          14.4.3.178 (14.4.3.178) port 0 AF_INET
          Socket  Message  Elapsed      Messages
          Size    Size     Time         Okay Errors   Throughput
          bytes   bytes    secs            #      #   10^6bits/sec
      
          114688     128   10.00     5142034      0     526.31
          114688           10.00     1130489            115.71
      
      With this patch I get both send and receive at ~315 mbps.
      
      The reason that send performance actually slows down is as follows:
      When using this patch, the overhead of the CPU for handling RX packets
      is dramatically reduced.  As a result, we do not experience RNR NAK
      messages from the receiver which cause the connection to be closed and
      reopened again; when the patch is not used, the receiver cannot handle
      the packets fast enough so there is less time to post new buffers and
      hence the mentioned RNR NACKs.  So what happens is that the
      application *thinks* it posted a certain number of packets for
      transmission but these packets are flushed and do not really get
      transmitted.  Since the connection gets opened and closed many times,
      each time netperf gets the CPU time that otherwise would have been
      given to IPoIB to actually transmit the packets.  This can be verified
      when looking at the port counters -- the output of ifconfig and the
      oputput of netperf (this is for the case without the patch):
      
          tx packets
          ==========
          port counter:   1,543,996
          ifconfig:       1,581,426
          netperf:        5,142,034
      
          rx packets
          ==========
          netperf         1,1304,089
      Signed-off-by: default avatarEli Cohen <eli@mellanox.co.il>
      f89271da
    • Roland Dreier's avatar
      RDMA: Remove subversion $Id tags · f3781d2e
      Roland Dreier authored
      They don't get updated by git and so they're worse than useless.
      Signed-off-by: default avatarRoland Dreier <rolandd@cisco.com>
      f3781d2e
    • Dotan Barak's avatar
      RDMA: Improve include file coding style · 4deccd6d
      Dotan Barak authored
      Remove subversion $Id lines and improve readability by fixing other
      coding style problems pointed out by checkpatch.pl.
      Signed-off-by: default avatarDotan Barak <dotanba@gmail.com>
      Signed-off-by: default avatarRoland Dreier <rolandd@cisco.com>
      4deccd6d
    • Robert P. J. Day's avatar
    • Eli Cohen's avatar
      IB/mlx4: Optimize QP stamping · 9670e553
      Eli Cohen authored
      The idea is that for QPs with fixed size work requests (eg selective
      signaling QPs), before stamping the WQE, we read the value of the DS
      field, which gives the effective size of the descriptor as used in the
      previous post.  Then we stamp only that area, since the rest of the
      descriptor is already stamped.
      
      When initializing the send queue buffer, make sure the DS field is
      initialized to the max descriptor size so that the subsequent stamping
      will be done on the entire descriptor area.
      Signed-off-by: default avatarEli Cohen <eli@mellanox.co.il>
      Signed-off-by: default avatarRoland Dreier <rolandd@cisco.com>
      9670e553
    • Moni Shoua's avatar
      IB/sa: Fail requests made while creating new SM AH · 164ba089
      Moni Shoua authored
      This patch solves a race that occurs after an event occurs that causes
      the SA query module to flush its SM address handle (AH).  When SM AH
      becomes invalid and needs an update it is handled by the global
      workqueue.  On the other hand this event is also handled in the IPoIB
      driver by queuing work in the ipoib_workqueue that does multicast
      joins.  Although queuing is in the right order, it is done to 2
      different workqueues and so there is no guarantee that the first to be
      queued is the first to be executed.
      
      This causes a problem because IPoIB may end up sending an request to
      the old SM, which will take a long time to time out (since the old SM
      is gone); this leads to a much longer than necessary interruption in
      multicast traffer.
      
      The patch sets the SA query module's SM AH to NULL when the event
      occurs, and until update_sm_ah() is done, any request that needs sm_ah
      fails with -EAGAIN return status.
      
      For consumers, the patch doesn't make things worse.  Before the patch,
      MADs are sent to the wrong SM so the request gets lost.  Consumers can
      be improved if they examine the return code and respond to EAGAIN
      properly but even without an improvement the situation is not getting
      worse.
      Signed-off-by: default avatarMoni Levy <monil@voltaire.com>
      Signed-off-by: default avatarMoni Shoua <monis@voltaire.com>
      Signed-off-by: default avatarRoland Dreier <rolandd@cisco.com>
      164ba089