1. 22 Sep, 2020 12 commits
    • Maxim Mikityanskiy's avatar
      net/mlx5e: Enhanced TX MPWQE for SKBs · 5af75c74
      Maxim Mikityanskiy authored
      This commit adds support for Enhanced TX MPWQE feature in the regular
      (SKB) data path. A MPWQE (multi-packet work queue element) can serve
      multiple packets, reducing the PCI bandwidth on control traffic.
      
      Two new stats (tx*_mpwqe_blks and tx*_mpwqe_pkts) are added. The feature
      is on by default and controlled by the skb_tx_mpwqe private flag.
      
      In a MPWQE, eseg is shared among all packets, so eseg-based offloads
      (IPSEC, GENEVE, checksum) run on a separate eseg that is compared to the
      eseg of the current MPWQE session to decide if the new packet can be
      added to the same session.
      
      MPWQE is not compatible with certain offloads and features, such as TLS
      offload, TSO, nonlinear SKBs. If such incompatible features are in use,
      the driver gracefully falls back to non-MPWQE.
      
      This change has no performance impact in TCP single stream test and
      XDP_TX single stream test.
      
      UDP pktgen, 64-byte packets, single stream, MPWQE off:
        Packet rate: 16.96 Mpps (±0.12 Mpps) -> 17.01 Mpps (±0.20 Mpps)
        Instructions per packet: 421 -> 429
        Cycles per packet: 156 -> 161
        Instructions per cycle: 2.70 -> 2.67
      
      UDP pktgen, 64-byte packets, single stream, MPWQE on:
        Packet rate: 16.96 Mpps (±0.12 Mpps) -> 20.94 Mpps (±0.33 Mpps)
        Instructions per packet: 421 -> 329
        Cycles per packet: 156 -> 123
        Instructions per cycle: 2.70 -> 2.67
      
      Enabling MPWQE can reduce PCI bandwidth:
        PCI Gen2, pktgen at fixed rate of 36864000 pps on 24 CPU cores:
          Inbound PCI utilization with MPWQE off: 80.3%
          Inbound PCI utilization with MPWQE on: 59.0%
        PCI Gen3, pktgen at fixed rate of 56064000 pps on 24 CPU cores:
          Inbound PCI utilization with MPWQE off: 65.4%
          Inbound PCI utilization with MPWQE on: 49.3%
      
      Enabling MPWQE can also reduce CPU load, increasing the packet rate in
      case of CPU bottleneck:
        PCI Gen2, pktgen at full rate on 24 CPU cores:
          Packet rate with MPWQE off: 37.5 Mpps
          Packet rate with MPWQE on: 49.0 Mpps
        PCI Gen3, pktgen at full rate on 24 CPU cores:
          Packet rate with MPWQE off: 57.0 Mpps
          Packet rate with MPWQE on: 66.8 Mpps
      
      Burst size in all pktgen tests is 32.
      
      CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (x86_64)
      NIC: Mellanox ConnectX-6 Dx
      GCC 10.2.0
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      5af75c74
    • Maxim Mikityanskiy's avatar
      net/mlx5e: Move TX code into functions to be used by MPWQE · 67044a88
      Maxim Mikityanskiy authored
      mlx5e_txwqe_complete performs some actions that can be taken to separate
      functions:
      
      1. Update the flags needed for hardware timestamping.
      
      2. Stop the TX queue if it's full.
      
      Take these actions into separate functions to be reused by the MPWQE
      code in the following commit and to maintain clear responsibilities of
      functions.
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      67044a88
    • Maxim Mikityanskiy's avatar
      net/mlx5e: Rename xmit-related structs to generalize them · b39fe61e
      Maxim Mikityanskiy authored
      As preparation for the upcoming TX MPWQE support for SKBs, rename struct
      mlx5e_xdp_mpwqe to mlx5e_tx_mpwqe and move it above struct mlx5e_txqsq.
      This structure will be reused in the regular SQ and in the regular TX
      data path. Also rename mlx5e_xdp_xmit_data to mlx5e_xmit_data - it will
      be used in the upcoming TX MPWQE flow.
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      b39fe61e
    • Maxim Mikityanskiy's avatar
      net/mlx5e: Generalize TX MPWQE checks for full session · 530d5ce2
      Maxim Mikityanskiy authored
      As preparation for the upcoming TX MPWQE for SKBs, create a function
      (mlx5e_tx_mpwqe_is_full) to check whether an MPWQE session is full. This
      function will be shared by MPWQE code for XDP and for SKBs. Defines are
      renamed and moved to make them not XDP-specific.
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      530d5ce2
    • Maxim Mikityanskiy's avatar
      net/mlx5e: Support multiple SKBs in a TX WQE · 338c46c6
      Maxim Mikityanskiy authored
      TX MPWQE support for SKBs is coming in one of the following patches, and
      a single MPWQE can send multiple SKBs. This commit prepares the TX path
      code to handle such cases:
      
      1. An additional FIFO for SKBs is added, just like the FIFO for DMA
      chunks.
      
      2. struct mlx5e_tx_wqe_info will contain num_fifo_pkts. If a given WQE
      contains only one packet, num_fifo_pkts will be zero, and the SKB will
      be stored in mlx5e_tx_wqe_info, as usual. If num_fifo_pkts > 0, the SKB
      pointer will be NULL, and the SKBs will be stored in the FIFO.
      
      This change has no performance impact in TCP single stream test and
      XDP_TX single stream test.
      
      When compiled with a recent GCC, this change shows no visible
      performance impact on UDP pktgen (burst 32) single stream test either:
        Packet rate: 16.95 Mpps (±0.15 Mpps) -> 16.96 Mpps (±0.12 Mpps)
        Instructions per packet: 429 -> 421
        Cycles per packet: 160 -> 156
        Instructions per cycle: 2.69 -> 2.70
      
      CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (x86_64)
      NIC: Mellanox ConnectX-6 Dx
      GCC 10.2.0
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      338c46c6
    • Maxim Mikityanskiy's avatar
      net/mlx5e: Move the TLS resync check out of the function · 56e4da66
      Maxim Mikityanskiy authored
      Before this patch, mlx5e_ktls_tx_handle_resync_dump_comp checked for
      resync_dump_frag_page. It happened for all WQEs without an SKB,
      including padding WQEs, and required a function call. Normally, padding
      WQEs happen more often than TLS resyncs. Take this check out of the
      function and put it to an inline function to save a call on all padding
      WQEs.
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      56e4da66
    • Maxim Mikityanskiy's avatar
      net/mlx5e: Unify constants for WQE_EMPTY_DS_COUNT · 97e3afd6
      Maxim Mikityanskiy authored
      A constant for the number of DS in an empty WQE (i.e. a WQE without data
      segments) is needed in multiple places (normal TX data path, MPWQE in
      XDP), but currently we have a constant for XDP and an inline formula in
      normal TX. This patch introduces a common constant.
      
      Additionally, mlx5e_xdp_mpwqe_session_start is converted to use struct
      assignment, because the code nearby is touched.
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      97e3afd6
    • Maxim Mikityanskiy's avatar
      net/mlx5e: Small improvements for XDP TX MPWQE logic · 388a2b56
      Maxim Mikityanskiy authored
      Use MLX5E_XDP_MPW_MAX_WQEBBS to reserve space for a MPWQE, because it's
      actually the maximal size a MPWQE can take.
      
      Reorganize the logic that checks when to close the MPWQE session:
      
      1. Put all checks into a single function.
      
      2. When inline is on, make only one comparison - if it's false, the less
      strict one will also be false. The compiler probably optimized it out
      anyway, but it's clearer to also reflect it in the code.
      
      The MLX5E_XDP_INLINE_WQE_* defines are also changed to make the
      calculations more correct from the logical point of view. Though
      MLX5E_XDP_INLINE_WQE_MAX_DS_CNT used to be 16 and didn't change its
      value, the calculation used to be DIV_ROUND_UP(max inline packet size,
      MLX5_SEND_WQE_DS), and the numerator should have included sizeof(struct
      mlx5_wqe_inline_seg).
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      388a2b56
    • Maxim Mikityanskiy's avatar
      net/mlx5e: Refactor xmit functions · 8e4b53f6
      Maxim Mikityanskiy authored
      A huge function mlx5e_sq_xmit was split into several to achieve multiple
      goals:
      
      1. Reuse the code in IPoIB.
      
      2. Better intergrate with TLS, IPSEC, GENEVE and checksum offloads. Now
      it's possible to reserve space in the WQ before running eseg-based
      offloads, so:
      
      2.1. It's not needed to copy cseg and eseg after mlx5e_fill_sq_frag_edge
      anymore.
      
      2.2. mlx5e_txqsq_get_next_pi will be used instead of the legacy
      mlx5e_fill_sq_frag_edge for better code maintainability and reuse.
      
      3. Prepare for the upcoming TX MPWQE for SKBs. It will intervene after
      mlx5e_sq_calc_wqe_attr to check if it's possible to use MPWQE, and the
      code flow will split into two paths: MPWQE and non-MPWQE.
      
      Two high-level functions are provided to send packets:
      
      * mlx5e_xmit is called by the networking stack, runs offloads and sends
      the packet. In one of the following patches, MPWQE support will be added
      to this flow.
      
      * mlx5e_sq_xmit_simple is called by the TLS offload, runs only the
      checksum offload and sends the packet.
      
      This change has no performance impact in TCP single stream test and
      XDP_TX single stream test.
      
      When compiled with a recent GCC, this change shows no visible
      performance impact on UDP pktgen (burst 32) single stream test either:
        Packet rate: 16.86 Mpps (±0.15 Mpps) -> 16.95 Mpps (±0.15 Mpps)
        Instructions per packet: 434 -> 429
        Cycles per packet: 158 -> 160
        Instructions per cycle: 2.75 -> 2.69
      
      CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (x86_64)
      NIC: Mellanox ConnectX-6 Dx
      GCC 10.2.0
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      8e4b53f6
    • Maxim Mikityanskiy's avatar
      net/mlx5e: Move mlx5e_tx_wqe_inline_mode to en_tx.c · d02dfcd5
      Maxim Mikityanskiy authored
      Move mlx5e_tx_wqe_inline_mode from en/txrx.h to en_tx.c as it's only
      used there.
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      d02dfcd5
    • Maxim Mikityanskiy's avatar
      net/mlx5e: Use struct assignment to initialize mlx5e_tx_wqe_info · 8ba6f183
      Maxim Mikityanskiy authored
      Struct assignment guarantees that all fields of the structure are
      initialized (those that are not mentioned are zeroed). It makes code
      mode robust and reduces chances for unpredictable behavior when one
      forgets to reset some field and it holds an old value from previous
      iterations of using the structure.
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      8ba6f183
    • Maxim Mikityanskiy's avatar
      net/mlx5e: Refactor inline header size calculation in the TX path · 6d55af43
      Maxim Mikityanskiy authored
      As preparation for the next patch, don't increase ihs to calculate
      ds_cnt and then decrease it, but rather calculate the intermediate value
      temporarily. This code has the same amount of arithmetic operations, but
      now allows to split out ds_cnt calculation, which will be performed in
      the next patch.
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      6d55af43
  2. 21 Sep, 2020 28 commits