Merge branch 'net-batched-receive-in-GRO-path'

Edward Cree says: ==================== net: batched receive in GRO path This series listifies part of GRO processing, in a manner which allows those packets which are not GROed (i.e. for which dev_gro_receive returns GRO_NORMAL) to be passed on to the listified regular receive path. dev_gro_receive() itself is not listified, nor the per-protocol GRO callback, since GRO's need to hold packets on lists under napi->gro_hash makes keeping the packets on other lists awkward, and since the GRO control block state of held skbs can refer only to one 'new' skb at a time. Instead, when napi_frags_finish() handles a GRO_NORMAL result, stash the skb onto a list in the napi struct, which is received at the end of the napi poll or when its length exceeds the (new) sysctl net.core.gro_normal_batch. Performance figures with this series, collected on a back-to-back pair of Solarflare sfn8522-r2 NICs with 120-second NetPerf tests. In the stats, sample size n for old and new code is 6 runs each; p is from a Welch t-test. Tests were run both with GRO enabled and disabled, the latter simulating uncoalesceable packets (e.g. due to IP or TCP options). The receive side (which was the device under test) had the NetPerf process pinned to one CPU, and the device interrupts pinned to a second CPU. CPU utilisation figures (used in cases of line-rate performance) are summed across all CPUs. net.core.gro_normal_batch was left at its default value of 8. TCP 4 streams, GRO on: all results line rate (9.415Gbps) net-next: 210.3% cpu after #1: 181.5% cpu (-13.7%, p=0.031 vs net-next) after #3: 196.7% cpu (- 8.4%, p=0.136 vs net-next) TCP 4 streams, GRO off: net-next: 8.017 Gbps after #1: 7.785 Gbps (- 2.9%, p=0.385 vs net-next) after #3: 7.604 Gbps (- 5.1%, p=0.282 vs net-next. But note *) TCP 1 stream, GRO off: net-next: 6.553 Gbps after #1: 6.444 Gbps (- 1.7%, p=0.302 vs net-next) after #3: 6.790 Gbps (+ 3.6%, p=0.169 vs net-next) TCP 1 stream, GRO on, busy_read = 50: all results line rate net-next: 156.0% cpu after #1: 174.5% cpu (+11.9%, p=0.015 vs net-next) after #3: 165.0% cpu (+ 5.8%, p=0.147 vs net-next) TCP 1 stream, GRO off, busy_read = 50: net-next: 6.488 Gbps after #1: 6.625 Gbps (+ 2.1%, p=0.059 vs net-next) after #3: 7.351 Gbps (+13.3%, p=0.026 vs net-next) TCP_RR 100 streams, GRO off, 8000 byte payload net-next: 995.083 us after #1: 969.167 us (- 2.6%, p=0.204 vs net-next) after #3: 976.433 us (- 1.9%, p=0.254 vs net-next) TCP_RR 100 streams, GRO off, 8000 byte payload, busy_read = 50: net-next: 2.851 ms after #1: 2.871 ms (+ 0.7%, p=0.134 vs net-next) after #3: 2.937 ms (+ 3.0%, p<0.001 vs net-next) TCP_RR 100 streams, GRO off, 1 byte payload, busy_read = 50: net-next: 867.317 us after #1: 865.717 us (- 0.2%, p=0.334 vs net-next) after #3: 868.517 us (+ 0.1%, p=0.414 vs net-next) (*) These tests produced a mixture of line-rate and below-line-rate results, meaning that statistically speaking the results were 'censored' by the upper bound, and were thus not normally distributed, making a Welch t-test mathematically invalid. I therefore also calculated estimators according to [1], which gave the following: net-next: 8.133 Gbps after #1: 8.130 Gbps (- 0.0%, p=0.499 vs net-next) after #3: 7.680 Gbps (- 5.6%, p=0.285 vs net-next) (though my procedure for determining ν wasn't mathematically well-founded either, so take that p-value with a grain of salt). A further check came from dividing the bandwidth figure by the CPU usage for each test run, giving: net-next: 3.461 after #1: 3.198 (- 7.6%, p=0.145 vs net-next) after #3: 3.641 (+ 5.2%, p=0.280 vs net-next) The above results are fairly mixed, and in most cases not statistically significant. But I think we can roughly conclude that the series marginally improves non-GROable throughput, without hurting latency (except in the large-payload busy-polling case, which in any case yields horrid performance even on net-next (almost triple the latency without busy-poll). Also, drivers which, unlike sfc, pass UDP traffic to GRO would expect to see a benefit from gaining access to batching. Changed in v3: * gro_normal_batch sysctl now uses SYSCTL_ONE instead of &one * removed RFC tags (no comments after a week means no-one objects, right?) Changed in v2: * During busy poll, call gro_normal_list() to receive batched packets after each cycle of the napi busy loop. See comments in Patch #3 for complications of doing the same in busy_poll_stop(). [1]: Cohen 1959, doi: 10.1080/00401706.1959.10489859 ==================== Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'net-batched-receive-in-GRO-path'
Edward Cree says: ==================== net: batched receive in GRO path This series listifies part of GRO processing, in a manner which allows those packets which are not GROed (i.e. for which dev_gro_receive returns GRO_NORMAL) to be passed on to the listified regular receive path. dev_gro_receive() itself is not listified, nor the per-protocol GRO callback, since GRO's need to hold packets on lists under napi->gro_hash makes keeping the packets on other lists awkward, and since the GRO control block state of held skbs can refer only to one 'new' skb at a time. Instead, when napi_frags_finish() handles a GRO_NORMAL result, stash the skb onto a list in the napi struct, which is received at the end of the napi poll or when its length exceeds the (new) sysctl net.core.gro_normal_batch. Performance figures with this series, collected on a back-to-back pair of Solarflare sfn8522-r2 NICs with 120-second NetPerf tests. In the stats, sample size n for old and new code is 6 runs each; p is from a Welch t-test. Tests were run both with GRO enabled and disabled, the latter simulating uncoalesceable packets (e.g. due to IP or TCP options). The receive side (which was the device under test) had the NetPerf process pinned to one CPU, and the device interrupts pinned to a second CPU. CPU utilisation figures (used in cases of line-rate performance) are summed across all CPUs. net.core.gro_normal_batch was left at its default value of 8. TCP 4 streams, GRO on: all results line rate (9.415Gbps) net-next: 210.3% cpu after #1: 181.5% cpu (-13.7%, p=0.031 vs net-next) after #3: 196.7% cpu (- 8.4%, p=0.136 vs net-next) TCP 4 streams, GRO off: net-next: 8.017 Gbps after #1: 7.785 Gbps (- 2.9%, p=0.385 vs net-next) after #3: 7.604 Gbps (- 5.1%, p=0.282 vs net-next. But note *) TCP 1 stream, GRO off: net-next: 6.553 Gbps after #1: 6.444 Gbps (- 1.7%, p=0.302 vs net-next) after #3: 6.790 Gbps (+ 3.6%, p=0.169 vs net-next) TCP 1 stream, GRO on, busy_read = 50: all results line rate net-next: 156.0% cpu after #1: 174.5% cpu (+11.9%, p=0.015 vs net-next) after #3: 165.0% cpu (+ 5.8%, p=0.147 vs net-next) TCP 1 stream, GRO off, busy_read = 50: net-next: 6.488 Gbps after #1: 6.625 Gbps (+ 2.1%, p=0.059 vs net-next) after #3: 7.351 Gbps (+13.3%, p=0.026 vs net-next) TCP_RR 100 streams, GRO off, 8000 byte payload net-next: 995.083 us after #1: 969.167 us (- 2.6%, p=0.204 vs net-next) after #3: 976.433 us (- 1.9%, p=0.254 vs net-next) TCP_RR 100 streams, GRO off, 8000 byte payload, busy_read = 50: net-next: 2.851 ms after #1: 2.871 ms (+ 0.7%, p=0.134 vs net-next) after #3: 2.937 ms (+ 3.0%, p<0.001 vs net-next) TCP_RR 100 streams, GRO off, 1 byte payload, busy_read = 50: net-next: 867.317 us after #1: 865.717 us (- 0.2%, p=0.334 vs net-next) after #3: 868.517 us (+ 0.1%, p=0.414 vs net-next) (*) These tests produced a mixture of line-rate and below-line-rate results, meaning that statistically speaking the results were 'censored' by the upper bound, and were thus not normally distributed, making a Welch t-test mathematically invalid. I therefore also calculated estimators according to [1], which gave the following: net-next: 8.133 Gbps after #1: 8.130 Gbps (- 0.0%, p=0.499 vs net-next) after #3: 7.680 Gbps (- 5.6%, p=0.285 vs net-next) (though my procedure for determining ν wasn't mathematically well-founded either, so take that p-value with a grain of salt). A further check came from dividing the bandwidth figure by the CPU usage for each test run, giving: net-next: 3.461 after #1: 3.198 (- 7.6%, p=0.145 vs net-next) after #3: 3.641 (+ 5.2%, p=0.280 vs net-next) The above results are fairly mixed, and in most cases not statistically significant. But I think we can roughly conclude that the series marginally improves non-GROable throughput, without hurting latency (except in the large-payload busy-polling case, which in any case yields horrid performance even on net-next (almost triple the latency without busy-poll). Also, drivers which, unlike sfc, pass UDP traffic to GRO would expect to see a benefit from gaining access to batching. Changed in v3: * gro_normal_batch sysctl now uses SYSCTL_ONE instead of &one * removed RFC tags (no comments after a week means no-one objects, right?) Changed in v2: * During busy poll, call gro_normal_list() to receive batched packets after each cycle of the napi busy loop. See comments in Patch #3 for complications of doing the same in busy_poll_stop(). [1]: Cohen 1959, doi: 10.1080/00401706.1959.10489859 ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
61552d2c · David S. Miller · 5e6d9fc7 · 323ebb61 · 61552d2c · 61552d2c
Commit 61552d2c authored Aug 08, 2019 by David S. Miller
5 changed files
--- a/drivers/net/ethernet/sfc/falcon/rx.c
+++ b/drivers/net/ethernet/sfc/falcon/rx.c
@@ -424,7 +424,6 @@ ef4_rx_packet_gro(struct ef4_channel *channel, struct ef4_rx_buffer *rx_buf,
 		  unsigned int n_frags, u8 *eh)
 {
 	struct napi_struct *napi = &channel->napi_str;
-	gro_result_t gro_result;
 	struct ef4_nic *efx = channel->efx;
 	struct sk_buff *skb;
@@ -460,9 +459,7 @@ ef4_rx_packet_gro(struct ef4_channel *channel, struct ef4_rx_buffer *rx_buf,
 	skb_record_rx_queue(skb, channel->rx_queue.core_index);
-	gro_result = napi_gro_frags(napi);
+	napi_gro_frags(napi);
-	if (gro_result != GRO_DROP)
-		channel->irq_mod_score += 2;
 }
 /* Allocate and construct an SKB around page fragments */

--- a/drivers/net/ethernet/sfc/rx.c
+++ b/drivers/net/ethernet/sfc/rx.c
@@ -412,7 +412,6 @@ efx_rx_packet_gro(struct efx_channel *channel, struct efx_rx_buffer *rx_buf,
 		  unsigned int n_frags, u8 *eh)
 {
 	struct napi_struct *napi = &channel->napi_str;
-	gro_result_t gro_result;
 	struct efx_nic *efx = channel->efx;
 	struct sk_buff *skb;
@@ -449,9 +448,7 @@ efx_rx_packet_gro(struct efx_channel *channel, struct efx_rx_buffer *rx_buf,
 	skb_record_rx_queue(skb, channel->rx_queue.core_index);
-	gro_result = napi_gro_frags(napi);
+	napi_gro_frags(napi);
-	if (gro_result != GRO_DROP)
-		channel->irq_mod_score += 2;
 }
 /* Allocate and construct an SKB around page fragments */

--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -332,6 +332,8 @@ struct napi_struct {
 	struct net_device	*dev;
 	struct gro_list		gro_hash[GRO_HASH_BUCKETS];
 	struct sk_buff		*skb;
+	struct list_head	rx_list; /* Pending GRO_NORMAL skbs */
+	int			rx_count; /* length of rx_list */
 	struct hrtimer		timer;
 	struct list_head	dev_list;
 	struct hlist_node	napi_hash_node;
@@ -4239,6 +4241,7 @@ extern int		dev_weight_rx_bias;
 extern int		dev_weight_tx_bias;
 extern int		dev_rx_weight;
 extern int		dev_tx_weight;
+extern int		gro_normal_batch;
 bool netdev_has_upper_dev(struct net_device *dev, struct net_device *upper_dev);
 struct net_device *netdev_upper_get_next_dev_rcu(struct net_device *dev,

--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3963,6 +3963,8 @@ int dev_weight_rx_bias __read_mostly = 1;  /* bias for backlog weight */
 int dev_weight_tx_bias __read_mostly = 1;  /* bias for output_queue quota */
 int dev_rx_weight __read_mostly = 64;
 int dev_tx_weight __read_mostly = 64;
+/* Maximum number of GRO_NORMAL skbs to batch up for list-RX */
+int gro_normal_batch __read_mostly = 8;
 /* Called with irq disabled */
 static inline void ____napi_schedule(struct softnet_data *sd,
@@ -5747,6 +5749,26 @@ struct sk_buff *napi_get_frags(struct napi_struct *napi)
 }
 EXPORT_SYMBOL(napi_get_frags);
+/* Pass the currently batched GRO_NORMAL SKBs up to the stack. */
+static void gro_normal_list(struct napi_struct *napi)
+{
+	if (!napi->rx_count)
+		return;
+	netif_receive_skb_list_internal(&napi->rx_list);
+	INIT_LIST_HEAD(&napi->rx_list);
+	napi->rx_count = 0;
+}
+/* Queue one GRO_NORMAL SKB up for list processing.  If batch size exceeded,
+ * pass the whole batch up to the stack.
+ */
+static void gro_normal_one(struct napi_struct *napi, struct sk_buff *skb)
+{
+	list_add_tail(&skb->list, &napi->rx_list);
+	if (++napi->rx_count >= gro_normal_batch)
+		gro_normal_list(napi);
+}
 static gro_result_t napi_frags_finish(struct napi_struct *napi,
 				      struct sk_buff *skb,
 				      gro_result_t ret)
@@ -5756,8 +5778,8 @@ static gro_result_t napi_frags_finish(struct napi_struct *napi,
 	case GRO_HELD:
 		__skb_push(skb, ETH_HLEN);
 		skb->protocol = eth_type_trans(skb, skb->dev);
-		if (ret == GRO_NORMAL && netif_receive_skb_internal(skb))
+		if (ret == GRO_NORMAL)
-			ret = GRO_DROP;
+			gro_normal_one(napi, skb);
 		break;
 	case GRO_DROP:
@@ -6034,6 +6056,8 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
 				 NAPIF_STATE_IN_BUSY_POLL)))
 		return false;
+	gro_normal_list(n);
 	if (n->gro_bitmask) {
 		unsigned long timeout = 0;
@@ -6119,10 +6143,19 @@ static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock)
 	 * Ideally, a new ndo_busy_poll_stop() could avoid another round.
 	 */
 	rc = napi->poll(napi, BUSY_POLL_BUDGET);
+	/* We can't gro_normal_list() here, because napi->poll() might have
+	 * rearmed the napi (napi_complete_done()) in which case it could
+	 * already be running on another CPU.
+	 */
 	trace_napi_poll(napi, rc, BUSY_POLL_BUDGET);
 	netpoll_poll_unlock(have_poll_lock);
-	if (rc == BUSY_POLL_BUDGET)
+	if (rc == BUSY_POLL_BUDGET) {
+		/* As the whole budget was spent, we still own the napi so can
+		 * safely handle the rx_list.
+		 */
+		gro_normal_list(napi);
 		__napi_schedule(napi);
+	}
 	local_bh_enable();
 }
@@ -6167,6 +6200,7 @@ void napi_busy_loop(unsigned int napi_id,
 		}
 		work = napi_poll(napi, BUSY_POLL_BUDGET);
 		trace_napi_poll(napi, work, BUSY_POLL_BUDGET);
+		gro_normal_list(napi);
 count:
 		if (work > 0)
 			__NET_ADD_STATS(dev_net(napi->dev),
@@ -6272,6 +6306,8 @@ void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
 	napi->timer.function = napi_watchdog;
 	init_gro_hash(napi);
 	napi->skb = NULL;
+	INIT_LIST_HEAD(&napi->rx_list);
+	napi->rx_count = 0;
 	napi->poll = poll;
 	if (weight > NAPI_POLL_WEIGHT)
 		netdev_err_once(dev, "%s() called with weight %d\n", __func__,
@@ -6368,6 +6404,8 @@ static int napi_poll(struct napi_struct *n, struct list_head *repoll)
 		goto out_unlock;
 	}
+	gro_normal_list(n);
 	if (n->gro_bitmask) {
 		/* flush too old packets
 		 * If HZ < 1000, flush all packets.

--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -567,6 +567,14 @@ static struct ctl_table net_core_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_do_static_key,
 	},
+	{
+		.procname	= "gro_normal_batch",
+		.data		= &gro_normal_batch,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ONE,
+	},
 	{ }
 };