1. 16 Jun, 2017 11 commits
    • David S. Miller's avatar
      Merge branch 'mlx4-XDP-performance-improvements' · 117b07e6
      David S. Miller authored
      Tariq Toukan says:
      
      ====================
      mlx4 XDP performance improvements
      
      This patchset contains data-path improvements, mainly for XDP_DROP
      and XDP_TX cases.
      
      Main patches:
      * Patch 2 by Saeed allows enabling optimized A0 RX steering (in HW) when
        setting a single RX ring.
        With this configuration, HW packet-rate dramatically improves,
        reaching 28.1 Mpps in XDP_DROP case for both IPv4 (37% gain)
        and IPv6 (53% gain).
      * Patch 6 enhances the XDP xmit function. Among other changes, now we
        ring one doorbell per NAPI. Patch gives 17% gain in XDP_TX case.
      * Patch 7 obsoletes the NAPI of XDP_TX completion queue and integrates its
        poll into the respective RX NAPI. Patch gives 15% gain in XDP_TX case.
      
      Series generated against net-next commit:
      f7aec129 rxrpc: Cache the congestion window setting
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      117b07e6
    • Tariq Toukan's avatar
      net/mlx4_en: Refactor mlx4_en_free_tx_desc · 4c07c132
      Tariq Toukan authored
      Some code re-ordering, functionally equivalent.
      
      - The !tx_info->inl check is evaluated anyway in both flows
        (common case/end case). Run it first, this might finish
        the flows earlier.
      - dma_unmap calls are identical in both flows, get it out
        of the if block into the common area.
      
      Performance tests:
      Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
      
      Gain is too small to be measurable, no degradation sensed.
      Results are similar for IPv4 and IPv6.
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Reviewed-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Cc: kernel-team@fb.com
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4c07c132
    • Tariq Toukan's avatar
      net/mlx4_en: Replace TXBB_SIZE multiplications with shift operations · 9573e0d3
      Tariq Toukan authored
      Define LOG_TXBB_SIZE, log of TXBB_SIZE, and use it with a shift
      operation instead of a multiplication with TXBB_SIZE.
      Operations are equivalent as TXBB_SIZE is a power of two.
      
      Performance tests:
      Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
      
      Gain is too small to be measurable, no degradation sensed.
      Results are similar for IPv4 and IPv6.
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Reviewed-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Cc: kernel-team@fb.com
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9573e0d3
    • Tariq Toukan's avatar
      net/mlx4_en: Increase default TX ring size · 77788b5b
      Tariq Toukan authored
      Increase the default TX ring size (from 512 to 1024) to match
      the RX ring size.
      This gives the XDP TX ring a better chance to keep up with the
      rate of its RX ring in case of a high load of XDP_TX actions.
      
      Tested:
      Ethtool counter rx_xdp_tx_full used to increase, after applying this
      patch it stopped.
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Reviewed-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Cc: kernel-team@fb.com
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      77788b5b
    • Tariq Toukan's avatar
      net/mlx4_en: Poll XDP TX completion queue in RX NAPI · 6c78511b
      Tariq Toukan authored
      Instead of having their own NAPIs, XDP TX completion queues get
      polled within the corresponding RX NAPI.
      This prevents any possible race on TX ring prod/cons indices,
      between the context that issues the transmits (RX NAPI) and the
      context that handles the completions (was previously done in
      a separate NAPI).
      
      This also improves performance, as it decreases the number
      of NAPIs running on a CPU, saving the overhead of syncing
      and switching between the contexts.
      
      Performance tests:
      Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
      Single queue no-RSS optimization ON.
      
      XDP_TX packet rate:
      -------------------------------------
           | Before    | After     | Gain |
      IPv4 | 12.0 Mpps | 13.8 Mpps |  15% |
      IPv6 | 12.0 Mpps | 13.8 Mpps |  15% |
      -------------------------------------
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Reviewed-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Cc: kernel-team@fb.com
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6c78511b
    • Tariq Toukan's avatar
      net/mlx4_en: Improve XDP xmit function · 36ea7964
      Tariq Toukan authored
      Several performance improvements in XDP TX datapath,
      including:
      - Ring a single doorbell for XDP TX ring per NAPI budget,
        instead of doing it per a lower threshold (was 8).
        This includes removing the flow of immediate doorbell ringing
        in case of a full TX ring.
      - Compiler branch predictor hints.
      - Calculate values in compile time rather than in runtime.
      
      Performance tests:
      Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
      Single queue no-RSS optimization ON.
      
      XDP_TX packet rate:
      -------------------------------------
           | Before    | After     | Gain |
      IPv4 | 10.3 Mpps | 12.0 Mpps |  17% |
      IPv6 | 10.3 Mpps | 12.0 Mpps |  17% |
      -------------------------------------
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Reviewed-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Cc: kernel-team@fb.com
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      36ea7964
    • Tariq Toukan's avatar
      net/mlx4_en: Improve stack xmit function · f28186d6
      Tariq Toukan authored
      Several small code and performance improvements in stack TX datapath,
      including:
      - Compiler branch predictor hints.
      - Minimize variables scope.
      - Move tx_info non-inline flow handling to a separate function.
      - Calculate data_offset in compile time rather than in runtime
        (for !lso_header_size branch).
      - Avoid trinary-operator ("?") when value can be preset in a matching
        branch.
      
      Performance tests:
      Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
      
      Gain is too small to be measurable, no degradation sensed.
      Results are similar for IPv4 and IPv6.
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Reviewed-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Cc: kernel-team@fb.com
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f28186d6
    • Tariq Toukan's avatar
      net/mlx4_en: Improve transmit CQ polling · cc26a490
      Tariq Toukan authored
      Several small performance improvements in TX CQ polling,
      including:
      - Compiler branch predictor hints.
      - Minimize variables scope.
      - More proper check of cq type.
      - Use boolean instead of int for a binary indication.
      
      Performance tests:
      Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
      
      Packet-rate tests for both regular stack and XDP use cases:
      No noticeable gain, no degradation.
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Reviewed-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Cc: kernel-team@fb.com
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cc26a490
    • Tariq Toukan's avatar
      net/mlx4_en: Improve receive data-path · 9bcee89a
      Tariq Toukan authored
      Several small performance improvements in RX datapath,
      including:
      - Compiler branch predictor hints.
      - Replace a multiplication with a shift operation.
      - Minimize variables scope.
      - Write-prefetch for packet header.
      - Avoid trinary-operator ("?") when value can be preset in a matching
        branch.
      - Save a branch by updating RX ring doorbell within
        mlx4_en_refill_rx_buffers(), which now returns void.
      
      Performance tests:
      Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
      Single queue no-RSS optimization ON
      (enable by ethtool -L <interface> rx 1).
      
      XDP_DROP packet rate:
      Same (28.1 Mpps), lower CPU utilization (from ~100% to ~92%).
      
      Drop packets in TC:
      -------------------------------------
           | Before    | After     | Gain |
      IPv4 | 4.14 Mpps | 4.18 Mpps |   1% |
      -------------------------------------
      
      XDP_TX packet rate:
      -------------------------------------
           | Before    | After     | Gain |
      IPv4 | 10.1 Mpps | 10.3 Mpps |   2% |
      IPv6 | 10.1 Mpps | 10.3 Mpps |   2% |
      -------------------------------------
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Reviewed-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Cc: kernel-team@fb.com
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9bcee89a
    • Saeed Mahameed's avatar
      net/mlx4_en: Optimized single ring steering · 4931c6ef
      Saeed Mahameed authored
      Avoid touching RX QP RSS context when loading with only
      one RX ring, to allow optimized A0 RX steering.
      
      Enable by:
      - loading mlx4_core with module param: log_num_mgm_entry_size = -6.
      - then: ethtool -L <interface> rx 1
      
      Performance tests:
      Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
      
      XDP_DROP packet rate:
      -------------------------------------
           | Before    | After     | Gain |
      IPv4 | 20.5 Mpps | 28.1 Mpps |  37% |
      IPv6 | 18.4 Mpps | 28.1 Mpps |  53% |
      -------------------------------------
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Cc: kernel-team@fb.com
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4931c6ef
    • Tariq Toukan's avatar
      net/mlx4_en: Remove unused argument in TX datapath function · cf97050d
      Tariq Toukan authored
      Remove owner argument, as it is obsolete and unused.
      This also saves the overhead of calculating its value in data-path.
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Reviewed-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Cc: kernel-team@fb.com
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cf97050d
  2. 15 Jun, 2017 29 commits