• Chuck Lever's avatar
    xprtrdma: Fix recursion into rpcrdma_xprt_disconnect() · 4cf44be6
    Chuck Lever authored
    Both Dan and I have observed two processes invoking
    rpcrdma_xprt_disconnect() concurrently. In my case:
    
    1. The connect worker invokes rpcrdma_xprt_disconnect(), which
       drains the QP and waits for the final completion
    2. This causes the newly posted Receive to flush and invoke
       xprt_force_disconnect()
    3. xprt_force_disconnect() sets CLOSE_WAIT and wakes up the RPC task
       that is holding the transport lock
    4. The RPC task invokes xprt_connect(), which calls ->ops->close
    5. xprt_rdma_close() invokes rpcrdma_xprt_disconnect(), which tries
       to destroy the QP.
    
    Deadlock.
    
    To prevent xprt_force_disconnect() from waking anything, handle the
    clean up after a failed connection attempt in the xprt's sndtask.
    
    The retry loop is removed from rpcrdma_xprt_connect() to ensure
    that the newly allocated ep and id are properly released before
    a REJECTED connection attempt can be retried.
    Reported-by: default avatarDan Aloni <dan@kernelim.com>
    Fixes: e28ce900 ("xprtrdma: kmalloc rpcrdma_ep separate from rpcrdma_xprt")
    Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
    Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
    4cf44be6
transport.c 21.9 KB