• Chuck Lever's avatar
    xprtrdma: Spread reply processing over more CPUs · ccede759
    Chuck Lever authored
    Commit d8f532d2 ("xprtrdma: Invoke rpcrdma_reply_handler
    directly from RECV completion") introduced a performance regression
    for NFS I/O small enough to not need memory registration. In multi-
    threaded benchmarks that generate primarily small I/O requests,
    IOPS throughput is reduced by nearly a third. This patch restores
    the previous level of throughput.
    
    Because workqueues are typically BOUND (in particular ib_comp_wq,
    nfsiod_workqueue, and rpciod_workqueue), NFS/RDMA workloads tend
    to aggregate on the CPU that is handling Receive completions.
    
    The usual approach to addressing this problem is to create a QP
    and CQ for each CPU, and then schedule transactions on the QP
    for the CPU where you want the transaction to complete. The
    transaction then does not require an extra context switch during
    completion to end up on the same CPU where the transaction was
    started.
    
    This approach doesn't work for the Linux NFS/RDMA client because
    currently the Linux NFS client does not support multiple connections
    per client-server pair, and the RDMA core API does not make it
    straightforward for ULPs to determine which CPU is responsible for
    handling Receive completions for a CQ.
    
    So for the moment, record the CPU number in the rpcrdma_req before
    the transport sends each RPC Call. Then during Receive completion,
    queue the RPC completion on that same CPU.
    
    Additionally, move all RPC completion processing to the deferred
    handler so that even RPCs with simple small replies complete on
    the CPU that sent the corresponding RPC Call.
    
    Fixes: d8f532d2 ("xprtrdma: Invoke rpcrdma_reply_handler ...")
    Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
    Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
    ccede759
xprt_rdma.h 20.3 KB