• Håkon Bugge's avatar
    IB/mlx4: Fix CM REQ retries in paravirt mode · 4542e3c7
    Håkon Bugge authored
    CM REQs cannot be successfully retried, because a new pv_cm_id is
    created for each request, without checking if one already exists.
    
    By checking if an id exists before creating one, the bug is fixed.
    
    This bug can be provoked by running an RDMA CM user-land application,
    but inserting a five seconds delay before the rdma_accept() call on
    the passive side. This delay is larger than the default CMA timeout,
    and triggers a retry from the active side. The retried REQ will use
    another pv_cm_id (the cm_id on the wire). This confuses the CM
    protocol and two REJs are sent from the passive side.
    
    Here is an excerpt from ibdump running without the patch:
    
    3.285092       LID: 4 -> LID: 4       SDP 290 CM: ConnectRequest(SDP Hello)
    7.382711       LID: 4 -> LID: 4       SDP 290 CM: ConnectRequest(SDP Hello)
    7.382861       LID: 4 -> LID: 4       InfiniBand 290 CM: ConnectReject
    7.387644       LID: 4 -> LID: 4       InfiniBand 290 CM: ConnectReject
    
    and here is the same with bug fix applied:
    
    3.251010       LID: 4 -> LID: 4       SDP 290 CM: ConnectRequest(SDP Hello)
    7.349387       LID: 4 -> LID: 4       SDP 290 CM: ConnectRequest(SDP Hello)
    8.258443       LID: 4 -> LID: 4       SDP 290 CM: ConnectReply(SDP Hello)
    8.259890       LID: 4 -> LID: 4       InfiniBand 290 CM: ReadyToUse
    Suggested-by: default avatarVenkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
    Signed-off-by: default avatarHåkon Bugge <haakon.bugge@oracle.com>
    Reported-by: default avatarWei Lin Guay <wei.lin.guay@oracle.com>
    Tested-by: default avatarWei Lin Guay <wei.lin.guay@oracle.com>
    Reviewed-by: default avatarYuval Shaia <yuval.shaia@oracle.com>
    Acked-by: default avatarJack Morgenstein <jackm@dev.mellanox.co.il>
    Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
    4542e3c7
cm.c 13.3 KB