Merge branch 'mlxsw-events-processing-performance'

Petr Machata says: ==================== mlxsw: Improve events processing performance Amit Cohen writes: Spectrum ASICs only support a single interrupt, it means that all the events are handled by one IRQ (interrupt request) handler. Currently, we schedule a tasklet to handle events in EQ, then we also use tasklet for CQ, SDQ and RDQ. Tasklet runs in softIRQ (software IRQ) context, and will be run on the same CPU which scheduled it. It means that today we have one CPU which handles all the packets (both network packets and EMADs) from hardware. The existing implementation is not efficient and can be improved. Measuring latency of EMADs in the driver (without the time in FW) shows that latency is increased by factor of 28 (x28) when network traffic is handled by the driver. Measuring throughput in CPU shows that CPU can handle ~35% less packets of specific flow when corrupted packets are also handled by the driver. There are cases that these values even worse, we measure decrease of ~44% packet rate. This can be improved if network packet and EMADs will be handled in parallel by several CPUs, and more than that, if different types of traffic will be handled in parallel. We can achieve this using NAPI. This set converts the driver to process completions from hardware via NAPI. The idea is to add NAPI instance per CQ (which is mapped 1:1 to SDQ/RDQ), which means that each DQ can be handled separately. we have DQ for EMADs and DQs for each trap group (like LLDP, BGP, L3 drops, etc..). See more details in commit messages. An additional improvement which is done as part of this set is related to doorbells' ring. The idea is to handle small chunks of Rx packets (which is also recommended using NAPI) and ring doorbells once per chunk. This reduces the access to hardware which is expensive (time wise) and might take time because of memory barriers. With this set we can see better performance. To summerize: EMADs latency: +------------------------------------------------------------------------+ | | Before this set | Now | |------------------|---------------------------|-------------------------| | Increased factor | x28 | x1.5 | +------------------------------------------------------------------------+ Note that we can see even measurements that show better latency when traffic is handled by the driver. Throughput: +------------------------------------------------------------------------+ | | Before this set | Now | |-------------|----------------------------|-----------------------------| | Reduced | 35% | 6% | | packet rate | | | +------------------------------------------------------------------------+ Additional improvements are planned - use page pool for buffer allocations and avoid cache miss of each SKB using napi_build_skb(). Patch set overview: Patches #1-#2 improve access to hardware by reducing dorbells' rings Patch #3-#4 are preaparations for NAPI usage Patch #5 converts the driver to use NAPI ==================== Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mlxsw-events-processing-performance'
Petr Machata says: ==================== mlxsw: Improve events processing performance Amit Cohen writes: Spectrum ASICs only support a single interrupt, it means that all the events are handled by one IRQ (interrupt request) handler. Currently, we schedule a tasklet to handle events in EQ, then we also use tasklet for CQ, SDQ and RDQ. Tasklet runs in softIRQ (software IRQ) context, and will be run on the same CPU which scheduled it. It means that today we have one CPU which handles all the packets (both network packets and EMADs) from hardware. The existing implementation is not efficient and can be improved. Measuring latency of EMADs in the driver (without the time in FW) shows that latency is increased by factor of 28 (x28) when network traffic is handled by the driver. Measuring throughput in CPU shows that CPU can handle ~35% less packets of specific flow when corrupted packets are also handled by the driver. There are cases that these values even worse, we measure decrease of ~44% packet rate. This can be improved if network packet and EMADs will be handled in parallel by several CPUs, and more than that, if different types of traffic will be handled in parallel. We can achieve this using NAPI. This set converts the driver to process completions from hardware via NAPI. The idea is to add NAPI instance per CQ (which is mapped 1:1 to SDQ/RDQ), which means that each DQ can be handled separately. we have DQ for EMADs and DQs for each trap group (like LLDP, BGP, L3 drops, etc..). See more details in commit messages. An additional improvement which is done as part of this set is related to doorbells' ring. The idea is to handle small chunks of Rx packets (which is also recommended using NAPI) and ring doorbells once per chunk. This reduces the access to hardware which is expensive (time wise) and might take time because of memory barriers. With this set we can see better performance. To summerize: EMADs latency: +------------------------------------------------------------------------+ | | Before this set | Now | |------------------|---------------------------|-------------------------| | Increased factor | x28 | x1.5 | +------------------------------------------------------------------------+ Note that we can see even measurements that show better latency when traffic is handled by the driver. Throughput: +------------------------------------------------------------------------+ | | Before this set | Now | |-------------|----------------------------|-----------------------------| | Reduced | 35% | 6% | | packet rate | | | +------------------------------------------------------------------------+ Additional improvements are planned - use page pool for buffer allocations and avoid cache miss of each SKB using napi_build_skb(). Patch set overview: Patches #1-#2 improve access to hardware by reducing dorbells' rings Patch #3-#4 are preaparations for NAPI usage Patch #5 converts the driver to use NAPI ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
fac87d32 · David S. Miller · b5327b9a · 3b0b3019 · fac87d32
Commit fac87d32 authored Apr 29, 2024 by David S. Miller
Show whitespace changes
Inline Side-by-side

Showing with 150 additions and 54 deletions

drivers/net/ethernet/mellanox/mlxsw/pci.c drivers/net/ethernet/mellanox/mlxsw/pci.c +150 -54

No files found.
--- a/drivers/net/ethernet/mellanox/mlxsw/pci.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/pci.c
@@ -82,12 +82,17 @@ struct mlxsw_pci_queue {
 	u8 num; /* queue number */
 	u8 elem_size; /* size of one element */
 	enum mlxsw_pci_queue_type type;
-	struct tasklet_struct tasklet; /* queue processing tasklet */
 	struct mlxsw_pci *pci;
+	union {
 		struct {
 			enum mlxsw_pci_cqe_v v;
 			struct mlxsw_pci_queue *dq;
+			struct napi_struct napi;
 		} cq;
+		struct {
+			struct tasklet_struct tasklet;
+		} eq;
+	} u;
 };

 struct mlxsw_pci_queue_type_group {
@@ -127,11 +132,40 @@ struct mlxsw_pci {
 	u8 num_cqs; /* Number of CQs */
 	u8 num_sdqs; /* Number of SDQs */
 	bool skip_reset;
+	struct net_device *napi_dev_tx;
+	struct net_device *napi_dev_rx;
 };

-static void mlxsw_pci_queue_tasklet_schedule(struct mlxsw_pci_queue *q)
+static int mlxsw_pci_napi_devs_init(struct mlxsw_pci *mlxsw_pci)
+{
+	int err;
+
+	mlxsw_pci->napi_dev_tx = alloc_netdev_dummy(0);
+	if (!mlxsw_pci->napi_dev_tx)
+		return -ENOMEM;
+	strscpy(mlxsw_pci->napi_dev_tx->name, "mlxsw_tx",
+		sizeof(mlxsw_pci->napi_dev_tx->name));
+
+	mlxsw_pci->napi_dev_rx = alloc_netdev_dummy(0);
+	if (!mlxsw_pci->napi_dev_rx) {
+		err = -ENOMEM;
+		goto err_alloc_rx;
+	}
+	strscpy(mlxsw_pci->napi_dev_rx->name, "mlxsw_rx",
+		sizeof(mlxsw_pci->napi_dev_rx->name));
+	dev_set_threaded(mlxsw_pci->napi_dev_rx, true);
+
+	return 0;
+
+err_alloc_rx:
+	free_netdev(mlxsw_pci->napi_dev_tx);
+	return err;
+}
+
+static void mlxsw_pci_napi_devs_fini(struct mlxsw_pci *mlxsw_pci)
 {
-	tasklet_schedule(&q->tasklet);
+	free_netdev(mlxsw_pci->napi_dev_rx);
+	free_netdev(mlxsw_pci->napi_dev_tx);
 }

 static char *__mlxsw_pci_queue_elem_get(struct mlxsw_pci_queue *q,
@@ -290,7 +324,7 @@ static int mlxsw_pci_sdq_init(struct mlxsw_pci *mlxsw_pci, char *mbox,
 		return err;

 	cq = mlxsw_pci_cq_get(mlxsw_pci, cq_num);
-	cq->cq.dq = q;
+	cq->u.cq.dq = q;
 	mlxsw_pci_queue_doorbell_producer_ring(mlxsw_pci, q);
 	return 0;
 }
@@ -399,7 +433,7 @@ static int mlxsw_pci_rdq_init(struct mlxsw_pci *mlxsw_pci, char *mbox,
 		return err;

 	cq = mlxsw_pci_cq_get(mlxsw_pci, cq_num);
-	cq->cq.dq = q;
+	cq->u.cq.dq = q;

 	mlxsw_pci_queue_doorbell_producer_ring(mlxsw_pci, q);

@@ -421,7 +455,7 @@ static int mlxsw_pci_rdq_init(struct mlxsw_pci *mlxsw_pci, char *mbox,
 		elem_info = mlxsw_pci_queue_elem_info_get(q, i);
 		mlxsw_pci_rdq_skb_free(mlxsw_pci, elem_info);
 	}
-	cq->cq.dq = NULL;
+	cq->u.cq.dq = NULL;
 	mlxsw_cmd_hw2sw_rdq(mlxsw_pci->core, q->num);

 	return err;
@@ -443,12 +477,12 @@ static void mlxsw_pci_rdq_fini(struct mlxsw_pci *mlxsw_pci,
 static void mlxsw_pci_cq_pre_init(struct mlxsw_pci *mlxsw_pci,
 				  struct mlxsw_pci_queue *q)
 {
-	q->cq.v = mlxsw_pci->max_cqe_ver;
+	q->u.cq.v = mlxsw_pci->max_cqe_ver;

-	if (q->cq.v == MLXSW_PCI_CQE_V2 &&
+	if (q->u.cq.v == MLXSW_PCI_CQE_V2 &&
 	    q->num < mlxsw_pci->num_sdqs &&
 	    !mlxsw_core_sdq_supports_cqe_v2(mlxsw_pci->core))
-		q->cq.v = MLXSW_PCI_CQE_V1;
+		q->u.cq.v = MLXSW_PCI_CQE_V1;
 }

 static unsigned int mlxsw_pci_read32_off(struct mlxsw_pci *mlxsw_pci,
@@ -630,9 +664,7 @@ static void mlxsw_pci_cqe_rdq_handle(struct mlxsw_pci *mlxsw_pci,
 	mlxsw_core_skb_receive(mlxsw_pci->core, skb, &rx_info);

 out:
-	/* Everything is set up, ring doorbell to pass elem to HW */
 	q->producer_counter++;
-	mlxsw_pci_queue_doorbell_producer_ring(mlxsw_pci, q);
 	return;
 }

@@ -644,7 +676,7 @@ static char *mlxsw_pci_cq_sw_cqe_get(struct mlxsw_pci_queue *q)

 	elem_info = mlxsw_pci_queue_elem_info_consumer_get(q);
 	elem = elem_info->elem;
-	owner_bit = mlxsw_pci_cqe_owner_get(q->cq.v, elem);
+	owner_bit = mlxsw_pci_cqe_owner_get(q->u.cq.v, elem);
 	if (mlxsw_pci_elem_hw_owned(q, owner_bit))
 		return NULL;
 	q->consumer_counter++;
@@ -652,20 +684,33 @@ static char *mlxsw_pci_cq_sw_cqe_get(struct mlxsw_pci_queue *q)
 	return elem;
 }

-static void mlxsw_pci_cq_rx_tasklet(struct tasklet_struct *t)
+static bool mlxsw_pci_cq_cqe_to_handle(struct mlxsw_pci_queue *q)
 {
-	struct mlxsw_pci_queue *q = from_tasklet(q, t, tasklet);
-	struct mlxsw_pci_queue *rdq = q->cq.dq;
+	struct mlxsw_pci_queue_elem_info *elem_info;
+	bool owner_bit;
+
+	elem_info = mlxsw_pci_queue_elem_info_consumer_get(q);
+	owner_bit = mlxsw_pci_cqe_owner_get(q->u.cq.v, elem_info->elem);
+	return !mlxsw_pci_elem_hw_owned(q, owner_bit);
+}
+
+static int mlxsw_pci_napi_poll_cq_rx(struct napi_struct *napi, int budget)
+{
+	struct mlxsw_pci_queue *q = container_of(napi, struct mlxsw_pci_queue,
+						 u.cq.napi);
+	struct mlxsw_pci_queue *rdq = q->u.cq.dq;
 	struct mlxsw_pci *mlxsw_pci = q->pci;
-	int credits = q->count >> 1;
-	int items = 0;
+	int work_done = 0;
 	char *cqe;

+	/* If the budget is 0, Rx processing should be skipped. */
+	if (unlikely(!budget))
+		return 0;
+
 	while ((cqe = mlxsw_pci_cq_sw_cqe_get(q))) {
 		u16 wqe_counter = mlxsw_pci_cqe_wqe_counter_get(cqe);
-		u8 sendq = mlxsw_pci_cqe_sr_get(q->cq.v, cqe);
-		u8 dqn = mlxsw_pci_cqe_dqn_get(q->cq.v, cqe);
-		char ncqe[MLXSW_PCI_CQE_SIZE_MAX];
+		u8 sendq = mlxsw_pci_cqe_sr_get(q->u.cq.v, cqe);
+		u8 dqn = mlxsw_pci_cqe_dqn_get(q->u.cq.v, cqe);

 		if (unlikely(sendq)) {
 			WARN_ON_ONCE(1);
@@ -677,32 +722,53 @@ static void mlxsw_pci_cq_rx_tasklet(struct tasklet_struct *t)
 			continue;
 		}

-		memcpy(ncqe, cqe, q->elem_size);
-		mlxsw_pci_queue_doorbell_consumer_ring(mlxsw_pci, q);
-
 		mlxsw_pci_cqe_rdq_handle(mlxsw_pci, rdq,
-					 wqe_counter, q->cq.v, ncqe);
+					 wqe_counter, q->u.cq.v, cqe);

-		if (++items == credits)
+		if (++work_done == budget)
 			break;
 	}

+	mlxsw_pci_queue_doorbell_consumer_ring(mlxsw_pci, q);
+	mlxsw_pci_queue_doorbell_producer_ring(mlxsw_pci, rdq);
+
+	if (work_done < budget)
+		goto processing_completed;
+
+	/* The driver still has outstanding work to do, budget was exhausted.
+	 * Return exactly budget. In that case, the NAPI instance will be polled
+	 * again.
+	 */
+	if (mlxsw_pci_cq_cqe_to_handle(q))
+		goto out;
+
+	/* The driver processed all the completions and handled exactly
+	 * 'budget'. Return 'budget - 1' to distinguish from the case that
+	 * driver still has completions to handle.
+	 */
+	if (work_done == budget)
+		work_done--;
+
+processing_completed:
+	if (napi_complete_done(napi, work_done))
 		mlxsw_pci_queue_doorbell_arm_consumer_ring(mlxsw_pci, q);
+out:
+	return work_done;
 }

-static void mlxsw_pci_cq_tx_tasklet(struct tasklet_struct *t)
+static int mlxsw_pci_napi_poll_cq_tx(struct napi_struct *napi, int budget)
 {
-	struct mlxsw_pci_queue *q = from_tasklet(q, t, tasklet);
-	struct mlxsw_pci_queue *sdq = q->cq.dq;
+	struct mlxsw_pci_queue *q = container_of(napi, struct mlxsw_pci_queue,
+						 u.cq.napi);
+	struct mlxsw_pci_queue *sdq = q->u.cq.dq;
 	struct mlxsw_pci *mlxsw_pci = q->pci;
-	int credits = q->count >> 1;
-	int items = 0;
+	int work_done = 0;
 	char *cqe;

 	while ((cqe = mlxsw_pci_cq_sw_cqe_get(q))) {
 		u16 wqe_counter = mlxsw_pci_cqe_wqe_counter_get(cqe);
-		u8 sendq = mlxsw_pci_cqe_sr_get(q->cq.v, cqe);
-		u8 dqn = mlxsw_pci_cqe_dqn_get(q->cq.v, cqe);
+		u8 sendq = mlxsw_pci_cqe_sr_get(q->u.cq.v, cqe);
+		u8 dqn = mlxsw_pci_cqe_dqn_get(q->u.cq.v, cqe);
 		char ncqe[MLXSW_PCI_CQE_SIZE_MAX];

 		if (unlikely(!sendq)) {
@@ -719,13 +785,23 @@ static void mlxsw_pci_cq_tx_tasklet(struct tasklet_struct *t)
 		mlxsw_pci_queue_doorbell_consumer_ring(mlxsw_pci, q);

 		mlxsw_pci_cqe_sdq_handle(mlxsw_pci, sdq,
-					 wqe_counter, q->cq.v, ncqe);
+					 wqe_counter, q->u.cq.v, ncqe);

-		if (++items == credits)
-			break;
+		work_done++;
 	}

+	/* If the budget is 0 napi_complete_done() should never be called. */
+	if (unlikely(!budget))
+		goto processing_completed;
+
+	work_done = min(work_done, budget - 1);
+	if (unlikely(!napi_complete_done(napi, work_done)))
+		goto out;
+
+processing_completed:
 	mlxsw_pci_queue_doorbell_arm_consumer_ring(mlxsw_pci, q);
+out:
+	return work_done;
 }

 static enum mlxsw_pci_cq_type
@@ -741,17 +817,29 @@ mlxsw_pci_cq_type(const struct mlxsw_pci *mlxsw_pci,
 	return MLXSW_PCI_CQ_RDQ;
 }

-static void mlxsw_pci_cq_tasklet_setup(struct mlxsw_pci_queue *q,
+static void mlxsw_pci_cq_napi_setup(struct mlxsw_pci_queue *q,
 				    enum mlxsw_pci_cq_type cq_type)
 {
+	struct mlxsw_pci *mlxsw_pci = q->pci;
+
 	switch (cq_type) {
 	case MLXSW_PCI_CQ_SDQ:
-		tasklet_setup(&q->tasklet, mlxsw_pci_cq_tx_tasklet);
+		netif_napi_add(mlxsw_pci->napi_dev_tx, &q->u.cq.napi,
+			       mlxsw_pci_napi_poll_cq_tx);
 		break;
 	case MLXSW_PCI_CQ_RDQ:
-		tasklet_setup(&q->tasklet, mlxsw_pci_cq_rx_tasklet);
+		netif_napi_add(mlxsw_pci->napi_dev_rx, &q->u.cq.napi,
+			       mlxsw_pci_napi_poll_cq_rx);
 		break;
 	}
+
+	napi_enable(&q->u.cq.napi);
+}
+
+static void mlxsw_pci_cq_napi_teardown(struct mlxsw_pci_queue *q)
+{
+	napi_disable(&q->u.cq.napi);
+	netif_napi_del(&q->u.cq.napi);
 }

 static int mlxsw_pci_cq_init(struct mlxsw_pci *mlxsw_pci, char *mbox,
@@ -765,13 +853,13 @@ static int mlxsw_pci_cq_init(struct mlxsw_pci *mlxsw_pci, char *mbox,
 	for (i = 0; i < q->count; i++) {
 		char *elem = mlxsw_pci_queue_elem_get(q, i);

-		mlxsw_pci_cqe_owner_set(q->cq.v, elem, 1);
+		mlxsw_pci_cqe_owner_set(q->u.cq.v, elem, 1);
 	}

-	if (q->cq.v == MLXSW_PCI_CQE_V1)
+	if (q->u.cq.v == MLXSW_PCI_CQE_V1)
 		mlxsw_cmd_mbox_sw2hw_cq_cqe_ver_set(mbox,
 				MLXSW_CMD_MBOX_SW2HW_CQ_CQE_VER_1);
-	else if (q->cq.v == MLXSW_PCI_CQE_V2)
+	else if (q->u.cq.v == MLXSW_PCI_CQE_V2)
 		mlxsw_cmd_mbox_sw2hw_cq_cqe_ver_set(mbox,
 				MLXSW_CMD_MBOX_SW2HW_CQ_CQE_VER_2);

@@ -786,7 +874,7 @@ static int mlxsw_pci_cq_init(struct mlxsw_pci *mlxsw_pci, char *mbox,
 	err = mlxsw_cmd_sw2hw_cq(mlxsw_pci->core, mbox, q->num);
 	if (err)
 		return err;
-	mlxsw_pci_cq_tasklet_setup(q, mlxsw_pci_cq_type(mlxsw_pci, q));
+	mlxsw_pci_cq_napi_setup(q, mlxsw_pci_cq_type(mlxsw_pci, q));
 	mlxsw_pci_queue_doorbell_consumer_ring(mlxsw_pci, q);
 	mlxsw_pci_queue_doorbell_arm_consumer_ring(mlxsw_pci, q);
 	return 0;
@@ -795,18 +883,19 @@ static int mlxsw_pci_cq_init(struct mlxsw_pci *mlxsw_pci, char *mbox,
 static void mlxsw_pci_cq_fini(struct mlxsw_pci *mlxsw_pci,
 			      struct mlxsw_pci_queue *q)
 {
+	mlxsw_pci_cq_napi_teardown(q);
 	mlxsw_cmd_hw2sw_cq(mlxsw_pci->core, q->num);
 }

 static u16 mlxsw_pci_cq_elem_count(const struct mlxsw_pci_queue *q)
 {
-	return q->cq.v == MLXSW_PCI_CQE_V2 ? MLXSW_PCI_CQE2_COUNT :
+	return q->u.cq.v == MLXSW_PCI_CQE_V2 ? MLXSW_PCI_CQE2_COUNT :
 					     MLXSW_PCI_CQE01_COUNT;
 }

 static u8 mlxsw_pci_cq_elem_size(const struct mlxsw_pci_queue *q)
 {
-	return q->cq.v == MLXSW_PCI_CQE_V2 ? MLXSW_PCI_CQE2_SIZE :
+	return q->u.cq.v == MLXSW_PCI_CQE_V2 ? MLXSW_PCI_CQE2_SIZE :
 					       MLXSW_PCI_CQE01_SIZE;
 }

@@ -829,7 +918,7 @@ static char *mlxsw_pci_eq_sw_eqe_get(struct mlxsw_pci_queue *q)
 static void mlxsw_pci_eq_tasklet(struct tasklet_struct *t)
 {
 	unsigned long active_cqns[BITS_TO_LONGS(MLXSW_PCI_CQS_MAX)];
-	struct mlxsw_pci_queue *q = from_tasklet(q, t, tasklet);
+	struct mlxsw_pci_queue *q = from_tasklet(q, t, u.eq.tasklet);
 	struct mlxsw_pci *mlxsw_pci = q->pci;
 	int credits = q->count >> 1;
 	u8 cqn, cq_count;
@@ -855,7 +944,7 @@ static void mlxsw_pci_eq_tasklet(struct tasklet_struct *t)
 	cq_count = mlxsw_pci->num_cqs;
 	for_each_set_bit(cqn, active_cqns, cq_count) {
 		q = mlxsw_pci_cq_get(mlxsw_pci, cqn);
-		mlxsw_pci_queue_tasklet_schedule(q);
+		napi_schedule(&q->u.cq.napi);
 	}
 }

@@ -891,7 +980,7 @@ static int mlxsw_pci_eq_init(struct mlxsw_pci *mlxsw_pci, char *mbox,
 	err = mlxsw_cmd_sw2hw_eq(mlxsw_pci->core, mbox, q->num);
 	if (err)
 		return err;
-	tasklet_setup(&q->tasklet, mlxsw_pci_eq_tasklet);
+	tasklet_setup(&q->u.eq.tasklet, mlxsw_pci_eq_tasklet);
 	mlxsw_pci_queue_doorbell_consumer_ring(mlxsw_pci, q);
 	mlxsw_pci_queue_doorbell_arm_consumer_ring(mlxsw_pci, q);
 	return 0;
@@ -1452,7 +1541,7 @@ static irqreturn_t mlxsw_pci_eq_irq_handler(int irq, void *dev_id)
 	struct mlxsw_pci_queue *q;

 	q = mlxsw_pci_eq_get(mlxsw_pci);
-	mlxsw_pci_queue_tasklet_schedule(q);
+	tasklet_schedule(&q->u.eq.tasklet);
 	return IRQ_HANDLED;
 }

@@ -1724,6 +1813,10 @@ static int mlxsw_pci_init(void *bus_priv, struct mlxsw_core *mlxsw_core,
 	if (err)
 		goto err_requery_resources;

+	err = mlxsw_pci_napi_devs_init(mlxsw_pci);
+	if (err)
+		goto err_napi_devs_init;
+
 	err = mlxsw_pci_aqs_init(mlxsw_pci, mbox);
 	if (err)
 		goto err_aqs_init;
@@ -1741,6 +1834,8 @@ static int mlxsw_pci_init(void *bus_priv, struct mlxsw_core *mlxsw_core,
 err_request_eq_irq:
 	mlxsw_pci_aqs_fini(mlxsw_pci);
 err_aqs_init:
+	mlxsw_pci_napi_devs_fini(mlxsw_pci);
+err_napi_devs_init:
 err_requery_resources:
 err_config_profile:
 err_cqe_v_check:
@@ -1768,6 +1863,7 @@ static void mlxsw_pci_fini(void *bus_priv)

 	free_irq(pci_irq_vector(mlxsw_pci->pdev, 0), mlxsw_pci);
 	mlxsw_pci_aqs_fini(mlxsw_pci);
+	mlxsw_pci_napi_devs_fini(mlxsw_pci);
 	mlxsw_pci_fw_area_fini(mlxsw_pci);
 	mlxsw_pci_free_irq_vectors(mlxsw_pci);
 }