net/mlx5e: SHAMPO, Coalesce skb fragments to page size

When doing hardware GRO (SHAMPO), the driver puts each data payload of a packet from the wire into one skb fragment. TCP Zero-Copy expects page sized skb fragments to be able to do it's page-flipping magic. With the current way of arranging fragments by the driver, only specific MTUs (page sized multiple + header size) will yield such page sized fragments in a high percentage. This change improves payload arrangement in the skb for hardware GRO by coalescing payloads into a single skb fragment when possible. To demonstrate the fix, running tcp_mmap with a MTU of 1500 yields: - Before: 0 % bytes mmap'ed - After : 81 % bytes mmap'ed More importantly, coalescing considerably improves the HW GRO performance. Here are the results for a iperf3 bandwidth benchmark: +---------+--------+--------+------------------------+-----------+ | streams | SW GRO | HW GRO | HW GRO with coalescing | Unit | |---------+--------+--------+------------------------+-----------| | 1 | 36 | 42 | 57 | Gbits/sec | | 4 | 34 | 39 | 50 | Gbits/sec | | 8 | 31 | 35 | 43 | Gbits/sec | +---------+--------+--------+------------------------+-----------+ Benchmark details: VM based setup CPU: Intel(R) Xeon(R) Platinum 8380 CPU, 24 cores NIC: ConnectX-7 100GbE iperf3 and irq running on same CPU over a single receive queue Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240603212219.1037656-15-tariqt@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: SHAMPO, Coalesce skb fragments to page size
When doing hardware GRO (SHAMPO), the driver puts each data payload of a packet from the wire into one skb fragment. TCP Zero-Copy expects page sized skb fragments to be able to do it's page-flipping magic. With the current way of arranging fragments by the driver, only specific MTUs (page sized multiple + header size) will yield such page sized fragments in a high percentage. This change improves payload arrangement in the skb for hardware GRO by coalescing payloads into a single skb fragment when possible. To demonstrate the fix, running tcp_mmap with a MTU of 1500 yields: - Before: 0 % bytes mmap'ed - After : 81 % bytes mmap'ed More importantly, coalescing considerably improves the HW GRO performance. Here are the results for a iperf3 bandwidth benchmark: +---------+--------+--------+------------------------+-----------+ | streams | SW GRO | HW GRO | HW GRO with coalescing | Unit | |---------+--------+--------+------------------------+-----------| | 1 | 36 | 42 | 57 | Gbits/sec | | 4 | 34 | 39 | 50 | Gbits/sec | | 8 | 31 | 35 | 43 | Gbits/sec | +---------+--------+--------+------------------------+-----------+ Benchmark details: VM based setup CPU: Intel(R) Xeon(R) Platinum 8380 CPU, 24 cores NIC: ConnectX-7 100GbE iperf3 and irq running on same CPU over a single receive queue Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240603212219.1037656-15-tariqt@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
14ae2fd1 · Dragos Tatulea · Jakub Kicinski · 99be5617 · 14ae2fd1
Commit 14ae2fd1 authored Jun 04, 2024 by Dragos Tatulea Committed by Jakub Kicinski Jun 05, 2024
Hide whitespace changes
Inline Side-by-side

Showing with 13 additions and 6 deletions

drivers/net/ethernet/mellanox/mlx5/core/en_rx.c drivers/net/ethernet/mellanox/mlx5/core/en_rx.c +13 -6

No files found.
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -523,15 +523,23 @@ mlx5e_add_skb_shared_info_frag(struct mlx5e_rq *rq, struct skb_shared_info *sinf

 static inline void
 mlx5e_add_skb_frag(struct mlx5e_rq *rq, struct sk_buff *skb,
-		   struct page *page, u32 frag_offset, u32 len,
+		   struct mlx5e_frag_page *frag_page,
+		   u32 frag_offset, u32 len,
 		   unsigned int truesize)
 {
-	dma_addr_t addr = page_pool_get_dma_addr(page);
+	dma_addr_t addr = page_pool_get_dma_addr(frag_page->page);
+	u8 next_frag = skb_shinfo(skb)->nr_frags;

 	dma_sync_single_for_cpu(rq->pdev, addr + frag_offset, len,
 				rq->buff.map_dir);
-	skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
-			page, frag_offset, len, truesize);
+
+	if (skb_can_coalesce(skb, next_frag, frag_page->page, frag_offset)) {
+		skb_coalesce_rx_frag(skb, next_frag - 1, len, truesize);
+	} else {
+		frag_page->frags++;
+		skb_add_rx_frag(skb, next_frag, frag_page->page,
+				frag_offset, len, truesize);
+	}
 }

 static inline void
@@ -1956,8 +1964,7 @@ mlx5e_shampo_fill_skb_data(struct sk_buff *skb, struct mlx5e_rq *rq,
 		u32 pg_consumed_bytes = min_t(u32, PAGE_SIZE - data_offset, data_bcnt);
 		unsigned int truesize = pg_consumed_bytes;

-		frag_page->frags++;
-		mlx5e_add_skb_frag(rq, skb, frag_page->page, data_offset,
+		mlx5e_add_skb_frag(rq, skb, frag_page, data_offset,
 				   pg_consumed_bytes, truesize);

 		data_bcnt -= pg_consumed_bytes;