Commit 4542e3c7 authored by Håkon Bugge's avatar Håkon Bugge Committed by Doug Ledford

IB/mlx4: Fix CM REQ retries in paravirt mode

CM REQs cannot be successfully retried, because a new pv_cm_id is
created for each request, without checking if one already exists.

By checking if an id exists before creating one, the bug is fixed.

This bug can be provoked by running an RDMA CM user-land application,
but inserting a five seconds delay before the rdma_accept() call on
the passive side. This delay is larger than the default CMA timeout,
and triggers a retry from the active side. The retried REQ will use
another pv_cm_id (the cm_id on the wire). This confuses the CM
protocol and two REJs are sent from the passive side.

Here is an excerpt from ibdump running without the patch:

3.285092       LID: 4 -> LID: 4       SDP 290 CM: ConnectRequest(SDP Hello)
7.382711       LID: 4 -> LID: 4       SDP 290 CM: ConnectRequest(SDP Hello)
7.382861       LID: 4 -> LID: 4       InfiniBand 290 CM: ConnectReject
7.387644       LID: 4 -> LID: 4       InfiniBand 290 CM: ConnectReject

and here is the same with bug fix applied:

3.251010       LID: 4 -> LID: 4       SDP 290 CM: ConnectRequest(SDP Hello)
7.349387       LID: 4 -> LID: 4       SDP 290 CM: ConnectRequest(SDP Hello)
8.258443       LID: 4 -> LID: 4       SDP 290 CM: ConnectReply(SDP Hello)
8.259890       LID: 4 -> LID: 4       InfiniBand 290 CM: ReadyToUse
Suggested-by: default avatarVenkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
Signed-off-by: default avatarHåkon Bugge <haakon.bugge@oracle.com>
Reported-by: default avatarWei Lin Guay <wei.lin.guay@oracle.com>
Tested-by: default avatarWei Lin Guay <wei.lin.guay@oracle.com>
Reviewed-by: default avatarYuval Shaia <yuval.shaia@oracle.com>
Acked-by: default avatarJack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
parent a25ce427
......@@ -323,6 +323,9 @@ int mlx4_ib_multiplex_cm_handler(struct ib_device *ibdev, int port, int slave_id
mad->mad_hdr.attr_id == CM_REP_ATTR_ID ||
mad->mad_hdr.attr_id == CM_SIDR_REQ_ATTR_ID) {
sl_cm_id = get_local_comm_id(mad);
id = id_map_get(ibdev, &pv_cm_id, slave_id, sl_cm_id);
if (id)
goto cont;
id = id_map_alloc(ibdev, slave_id, sl_cm_id);
if (IS_ERR(id)) {
mlx4_ib_warn(ibdev, "%s: id{slave: %d, sl_cm_id: 0x%x} Failed to id_map_alloc\n",
......@@ -343,6 +346,7 @@ int mlx4_ib_multiplex_cm_handler(struct ib_device *ibdev, int port, int slave_id
return -EINVAL;
}
cont:
set_local_comm_id(mad, id->pv_cm_id);
if (mad->mad_hdr.attr_id == CM_DREQ_ATTR_ID)
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment