• Tariq Toukan's avatar
    net/mlx5e: RX, Enhance legacy Receive Queue memory scheme · 069d1146
    Tariq Toukan authored
    Enhance the memory scheme of the legacy RQ, such that
    only order-0 pages are used.
    
    Whenever possible, prefer using a linear SKB, and build it
    wrapping the WQE buffer.
    
    Otherwise (for example, jumbo frames on x86), use non-linear SKB,
    with as many frags as needed. In this case, multiple WQE
    scatter entries are used, up to a maximum of 4 frags and 10KB of MTU.
    
    This implied to remove support of HW LRO in legacy RQ, as it would
    require large number of page allocations and scatter entries per WQE
    on archs with PAGE_SIZE = 4KB, yielding bad performance.
    
    In earlier patches, we guaranteed that all completions are in-order,
    and that we use a cyclic WQ.
    This creates an oppurtunity for a performance optimization:
    The mapping between a "struct mlx5e_dma_info", and the
    WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant
    across different cycles of a WQ. This allows initializing
    the mapping in the time of RQ creation, and not handle it
    in datapath.
    
    A struct mlx5e_dma_info that is shared between different WQEs
    is allocated by the first WQE, and freed by the last one.
    This implies an important requirement: WQEs that share the same
    struct mlx5e_dma_info must be posted within the same NAPI.
    Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly
    point to the new struct mlx5e_dma_info, not the one that was posted
    (and the HW wrote to).
    This bulking requirement is actually good also for performance reasons,
    hence we extend the bulk beyong the minimal requirement above.
    
    With this memory scheme, the RQs memory footprint is reduce by a
    factor of 2 on x86, and by a factor of 32 on PowerPC.
    Same factors apply for the number of pages in a GRO session.
    
    Performance tests:
    ConnectX-4, single core, single RX ring, default MTU.
    
    x86:
    CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
    
    Packet rate (early drop in TC): no degradation
    TCP streams: ~5% improvement
    
    PowerPC:
    CPU: POWER8 (raw), altivec supported
    
    Packet rate (early drop in TC): 20% gain
    TCP streams: 25% gain
    Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
    Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
    069d1146
en_main.c 126 KB