1. 15 Mar, 2011 5 commits
    • Roland Dreier's avatar
    • Sean Hefty's avatar
      RDMA/cma: Replace global lock in rdma_destroy_id() with id-specific one · a396d43a
      Sean Hefty authored
      rdma_destroy_id currently uses the global rdma cm 'lock' to test if an
      rdma_cm_id has been bound to a device.  This prevents an active
      address resolution callback handler from assigning a device to the
      rdma_cm_id after rdma_destroy_id checks for one.
      
      Instead, we can replace the use of the global lock around the check to
      the rdma_cm_id device pointer by setting the id state to destroying,
      then flushing all active callbacks.  The latter is accomplished by
      acquiring and releasing the handler_mutex.  Any active handler will
      complete first, and any newly scheduled handlers will find the
      rdma_cm_id in an invalid state.
      
      In addition to optimizing the current locking scheme, the use of the
      rdma_cm_id mutex is a more intuitive synchronization mechanism than
      that of the global lock.  These changes are based on feedback from
      Doug Ledford <dledford@redhat.com> while he was trying to debug a
      crash in the rdma cm destroy path.
      Signed-off-by: default avatarSean Hefty <sean.hefty@intel.com>
      Signed-off-by: default avatarRoland Dreier <roland@purestorage.com>
      a396d43a
    • Sean Hefty's avatar
      IB/cm: Cancel pending LAP message when exiting IB_CM_ESTABLISH state · 8d8ac865
      Sean Hefty authored
      This problem was reported by Moni Shoua <monis@mellanox.com> and Amir
      Vadai <amirv@mellanox.com>:
      
      	When destroying a cm_id from a context of a work queue and if
      	the lap_state of this cm_id is IB_CM_LAP_SENT, we need to
      	release the reference of this id that was taken upon the send
      	of the LAP message.  Otherwise, if the expected APR message
      	gets lost, it is only after a long time that the reference
      	will be released, while during that the work handler thread is
      	not available to process other things.
      
      It turns out that we need to cancel any pending LAP messages whenever
      we transition out of the IB_CM_ESTABLISH state.  This occurs when
      disconnecting - either sending or receiving a DREQ.  It can also
      happen in a corner case where we receive a REJ message after sending
      an RTU, followed by a LAP.  Add checks and cancel any outstanding LAP
      messages in these three cases.
      
      Canceling the LAP when sending a DREQ fixes the destroy problem
      reported by Moni.  When a cm_id is destroyed in the IB_CM_ESTABLISHED
      state, it sends a DREQ to the remote side to notify the peer that the
      connection is going away.
      Signed-off-by: default avatarSean Hefty <sean.hefty@intel.com>
      Signed-off-by: default avatarRoland Dreier <roland@purestorage.com>
      8d8ac865
    • Sean Hefty's avatar
      IB/cm: Bump reference count on cm_id before invoking callback · 29963437
      Sean Hefty authored
      When processing a SIDR REQ, the ib_cm allocates a new cm_id.  The
      refcount of the cm_id is initialized to 1.  However, cm_process_work
      will decrement the refcount after invoking all callbacks.  The result
      is that the cm_id will end up with refcount set to 0 by the end of the
      sidr req handler.
      
      If a user tries to destroy the cm_id, the destruction will proceed,
      under the incorrect assumption that no other threads are referencing
      the cm_id.  This can lead to a crash when the cm callback thread tries
      to access the cm_id.
      
      This problem was noticed as part of a larger investigation with kernel
      crashes in the rdma_cm when running on a real time OS.
      Signed-off-by: default avatarSean Hefty <sean.hefty@intel.com>
      Acked-by: default avatarDoug Ledford <dledford@redhat.com>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarRoland Dreier <roland@purestorage.com>
      29963437
    • Sean Hefty's avatar
      RDMA/cma: Fix crash in request handlers · 25ae21a1
      Sean Hefty authored
      Doug Ledford and Red Hat reported a crash when running the rdma_cm on
      a real-time OS.  The crash has the following call trace:
      
          cm_process_work
             cma_req_handler
                cma_disable_callback
                rdma_create_id
                   kzalloc
                   init_completion
                cma_get_net_info
                cma_save_net_info
                cma_any_addr
                   cma_zero_addr
                rdma_translate_ip
                   rdma_copy_addr
                cma_acquire_dev
                   rdma_addr_get_sgid
                   ib_find_cached_gid
                   cma_attach_to_dev
                ucma_event_handler
                   kzalloc
                   ib_copy_ah_attr_to_user
                cma_comp
      
      [ preempted ]
      
          cma_write
              copy_from_user
              ucma_destroy_id
                 copy_from_user
                 _ucma_find_context
                 ucma_put_ctx
                 ucma_free_ctx
                    rdma_destroy_id
                       cma_exch
                       cma_cancel_operation
                       rdma_node_get_transport
      
              rt_mutex_slowunlock
              bad_area_nosemaphore
              oops_enter
      
      They were able to reproduce the crash multiple times with the
      following details:
      
          Crash seems to always happen on the:
                  mutex_unlock(&conn_id->handler_mutex);
          as conn_id looks to have been freed during this code path.
      
      An examination of the code shows that a race exists in the request
      handlers.  When a new connection request is received, the rdma_cm
      allocates a new connection identifier.  This identifier has a single
      reference count on it.  If a user calls rdma_destroy_id() from another
      thread after receiving a callback, rdma_destroy_id will proceed to
      destroy the id and free the associated memory.  However, the request
      handlers may still be in the process of running.  When control returns
      to the request handlers, they can attempt to access the newly created
      identifiers.
      
      Fix this by holding a reference on the newly created rdma_cm_id until
      the request handler is through accessing it.
      Signed-off-by: default avatarSean Hefty <sean.hefty@intel.com>
      Acked-by: default avatarDoug Ledford <dledford@redhat.com>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarRoland Dreier <roland@purestorage.com>
      25ae21a1
  2. 14 Mar, 2011 10 commits
  3. 23 Feb, 2011 1 commit
  4. 18 Feb, 2011 8 commits
  5. 17 Feb, 2011 10 commits
  6. 16 Feb, 2011 6 commits