Commits · 2388ba36e94594406a755aceafc5983c289e68bf · Kirill Smelkov / linux

31 Oct, 2019 40 commits

dpaa_eth: cleanup skb_to_contig_fd() · 2388ba36

Madalin Bucur authored Oct 31, 2019

Remove cast, align variable name, simplify DMA map size computation.
Signed-off-by: Madalin Bucur <madalin.bucur@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2388ba36

dpaa_eth: use fd information in dpaa_cleanup_tx_fd() · 7689d82c

Madalin Bucur authored Oct 31, 2019

Instead of reading skb fields, use information from the DPAA frame
descriptor.
Signed-off-by: Madalin Bucur <madalin.bucur@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7689d82c

dpaa_eth: simplify variables used in dpaa_cleanup_tx_fd() · ae1512fb

Madalin Bucur authored Oct 31, 2019

Avoid casts and repeated conversions.
Signed-off-by: Madalin Bucur <madalin.bucur@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ae1512fb

dpaa_eth: avoid timestamp read on error paths · 9a4f4f3a

Madalin Bucur authored Oct 31, 2019

The dpaa_cleanup_tx_fd() function is called by the frame transmit
confirmation callback but also on several error paths. This function
is reading the transmit timestamp value. Avoid reading an invalid
timestamp value on the error paths.

Fixes: 4664856e ("dpaa_eth: add support for hardware timestamping")
Signed-off-by: Madalin Bucur <madalin.bucur@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9a4f4f3a

dpaa_eth: perform DMA unmapping before read · c70fd318

Madalin Bucur authored Oct 31, 2019

DMA unmapping is required before accessing the HW provided timestamping
information.

Fixes: 4664856e ("dpaa_eth: add support for hardware timestamping")
Signed-off-by: Madalin Bucur <madalin.bucur@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c70fd318

dpaa_eth: use page backed rx buffers · 8151ee88

Madalin Bucur authored Oct 31, 2019

Change the buffers used for reception from netdev_frags to pages.
Signed-off-by: Madalin Bucur <madalin.bucur@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8151ee88

dpaa_eth: use only one buffer pool per interface · f07f3004

Madalin Bucur authored Oct 31, 2019

Currently the DPAA Ethernet driver is using three buffer pools
for each interface, with three different sizes for the buffers
provided for the FMan reception path. This patch reduces the
number of buffer pools to one per interface. This change is in
preparation of another, that will be switching from netdev_frags
to page backed buffers for the receive path.
Signed-off-by: Madalin Bucur <madalin.bucur@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f07f3004

Merge branch 'net-hns3-add-some-optimizations-and-cleanups' · 2bd7c3e1

David S. Miller authored Oct 31, 2019

Huazhong Tan says:

====================
net: hns3: add some optimizations and cleanups

This series adds some code optimizations and cleanups for
the HNS3 ethernet driver.

[patch 1/9] dumps some debug information when reset fail.

[patch 2/9] dumps some struct netdev_queue information when
TX timeout.

[patch 3/9] cleanups some magic numbers.

[patch 4/9] cleanups some coding style issue.

[patch 5/9] fixes a compiler warning.

[patch 6/9] optimizes some local variable initialization.

[patch 7/9] modifies some comments.

[patch 8/9] cleanups some print format warnings.

[patch 9/9] cleanups byte order issue.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

2bd7c3e1

net: hns3: cleanup byte order issues when printed · 39edaf24

Guojia Liao authored Oct 31, 2019

Though the hip08 and the IMP(Intelligent Management Processor)
have the same byte order right now, it is better to convert
__be or __le variable into the CPU's byte order before print.
Signed-off-by: Guojia Liao <liaoguojia@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

39edaf24

net: hns3: cleanup some print format warning · adcf738b

Guojia Liao authored Oct 31, 2019

Using '%d' for printing type unsigned int or '%u' for
type int would cause static tools to give false warnings,
so this patch cleanups this warning by using the suitable
format specifier of the type of variable.

BTW, modifies the type of some variables and macro to
synchronize with their usage.
Signed-off-by: Guojia Liao <liaoguojia@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

adcf738b

net: hns3: add or modify some comments · 9e690456

Guangbin Huang authored Oct 31, 2019

This patch makes the comment for macro HCLGE_MBX_GET_VF_FLR_STATUS
more correct, and adds comments in some place to make the code more
readable.
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9e690456

net: hns3: optimize local variable initialization · 0bfdf286

Guangbin Huang authored Oct 31, 2019

The variable tx_ring is unnecessary to be initialized as it will be set
before used, and the variable rst_cnt is better to be initialized when
declaration for simplification.
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0bfdf286

net: hns3: cleanup a format-truncation warning · e4b806ed

Guojia Liao authored Oct 31, 2019

In hns3_nic_init_irq(), when '*_int_idx' has more than 9 digits
and the length of netdev's name is IFNAMSIZ, the total length
of final name will be bigger the HNAE3_INT_NAME_LEN - 1, even
though '*_int_idx' will never have such large value, but the
compiler gives a format-truncation warning for this case.

So this patch just enlarges the length to avoid this warning.
Signed-off-by: Guojia Liao <liaoguojia@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e4b806ed

net: hns3: cleanup some coding style issues · db4d3d55

Guangbin Huang authored Oct 31, 2019

To unify code style and make code simpler, this patch modifies
some code, deletes unnecessary blank lines and {}, changes
location of code, and so on.

No functional change.
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

db4d3d55

net: hns3: cleanup some magic numbers · d6ad7c53

Guojia Liao authored Oct 31, 2019

To make the code more readable, this patch replaces
some magic numbers with macro or sizeof operation.

Also uses macro lower_32_bits and upper_32_bits to
get bits 0-31 and 32-63 of a number, instead of
using type conversion and '>>' operation.

No functional change.
Signed-off-by: Guojia Liao <liaoguojia@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d6ad7c53

net: hns3: add struct netdev_queue debug info for TX timeout · 647522a5

Yunsheng Lin authored Oct 31, 2019

When there is a TX timeout, we can tell if the driver or stack
has stopped the queue by looking at state field, and when has
the last packet transmited by looking at trans_start field.

So this patch prints these two field in the
hns3_get_tx_timeo_queue_info().
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

647522a5

net: hns3: dump some debug information when reset fail · 3d77d0cb

Huazhong Tan authored Oct 31, 2019

When reset fails, there is some information that will help for
finding out why does reset fail. and removes an unused
core_rst_cnt field in struct hclge_rst_stats.
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3d77d0cb

Merge branch 'bnxt_en-Add-OP-TEE-based-bnxt-f-w-manager' · 79697744

David S. Miller authored Oct 31, 2019

Sheetal Tigadoli says:

====================
bnxt_en: Add OP-TEE based bnxt f/w manager

This patch series adds support for TEE based BNXT firmware
management module and the driver changes to invoke OP-TEE
APIs to fastboot firmware and to collect crash dump.

Changes from v4:
 - update Kconfig to reflect dependency on TEE driver
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

79697744

bnxt_en: Add support to collect crash dump via ethtool · 0b0eacf3

Vasundhara Volam authored Oct 31, 2019

Driver supports 2 types of core dumps.

1. Live dump - Firmware dump when system is up and running.
2. Crash dump - Dump which is collected during firmware crash
                that can be retrieved after recovery.
Crash dump is currently supported only on specific 58800 chips
which can be retrieved using OP-TEE API only, as firmware cannot
access this region directly.

User needs to set the dump flag using following command before
initiating the dump collection:

    $ ethtool -W|--set-dump eth0 N

Where N is "0" for live dump and "1" for crash dump

Command to collect the dump after setting the flag:

    $ ethtool -w eth0 data Filename

v3: Modify set_dump to support even when CONFIG_TEE_BNXT_FW=n.
Also change log message to netdev_info().

Cc: Jakub Kicinski <jakub.kicinski@netronome.com>
Cc: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Signed-off-by: Sheetal Tigadoli <sheetal.tigadoli@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0b0eacf3

bnxt_en: Add support to invoke OP-TEE API to reset firmware · e07ab202

Vasundhara Volam authored Oct 31, 2019

In error recovery process when firmware indicates that it is
completely down, initiate a firmware reset by calling OP-TEE API.

Cc: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Signed-off-by: Sheetal Tigadoli <sheetal.tigadoli@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e07ab202

firmware: broadcom: add OP-TEE based BNXT f/w manager · 24688095

Vikas Gupta authored Oct 31, 2019

This driver registers on TEE bus to interact with OP-TEE based
BNXT firmware management modules

Cc: Jakub Kicinski <jakub.kicinski@netronome.com>
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com>
Signed-off-by: Sheetal Tigadoli <sheetal.tigadoli@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

24688095

Merge branch 'mlxsw-Make-port-split-code-more-generic' · 8c933eab

David S. Miller authored Oct 31, 2019

Ido Schimmel says:

====================
mlxsw: Make port split code more generic

Jiri says:

Currently, we assume some limitations and constant values which are not
applicable for Spectrum-3 which has 8 lanes ports (instead of previous 4
lanes).

This patch does 2 things:

1) Generalizes the code to not use constants so it can work for 4, 8 and
   possibly 16 lanes.

2) Enforces some assumptions we had in the code but did not check.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

8c933eab

mlxsw: spectrum: Generalize split count check · 973b7fdb

Jiri Pirko authored Oct 31, 2019

Make the check generic for any possible value, not only 2 and 4.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Shalom Toledo <shalomt@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

973b7fdb

mlxsw: spectrum: Iterate over all ports in gap during unsplit create · fbbeea31

Jiri Pirko authored Oct 31, 2019

During recreation of original unsplit ports, just simply iterate over
the whole gap and recreate whatever originally existed.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Shalom Toledo <shalomt@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fbbeea31

mlxsw: spectrum: Fix base port get for split count 4 and 8 · c3a64b51

Jiri Pirko authored Oct 31, 2019

The current code considers only split by 2 or 4. Make the base port
getting generic and allow split by 8 to be handled correctly. Generalize
the used port checks as well.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Shalom Toledo <shalomt@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c3a64b51

mlxsw: spectrum: Use port_module_max_width to compute base port index · 013da297

Jiri Pirko authored Oct 31, 2019

Instead of using constant value, use port_module_max_width which is
aligned with the cluster size.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Shalom Toledo <shalomt@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

013da297

mlxsw: spectrum: Remember split base local port and use it in unsplit · 49185277

Jiri Pirko authored Oct 31, 2019

Don't compute the original base local port during unsplit, rather
remember it in mlxsw_sp_port structure during split port creation.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Shalom Toledo <shalomt@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

49185277

mlxsw: spectrum: Introduce resource for getting offset of 4 lanes split port · 038784a9

Jiri Pirko authored Oct 31, 2019

In Spectrum-3 the modules have 8 lanes, so split by count 2 results in
two split ports each of 4 lanes. Add a resource that can be used to
obtain local port offset in that case.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

038784a9

mlxsw: spectrum: Push getting offsets of split ports into a helper · d0846ce9

Jiri Pirko authored Oct 31, 2019

Get local port offsets of split port in a separate helper function and
use it in both split and unsplit function.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Shalom Toledo <shalomt@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d0846ce9

mlxsw: spectrum: Add sanity checks into module info get · c8fc10dc

Jiri Pirko authored Oct 31, 2019

Driver assumes certain values in the PMLP register. Add checks that
verify that PMLP register provides fitting values.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Shalom Toledo <shalomt@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c8fc10dc

mlxsw: spectrum: Pass mapping values in port mapping structure · 35896d96

Jiri Pirko authored Oct 31, 2019

Pass the port mapping structure down to create, module_map and other
function instead of individual values.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Shalom Toledo <shalomt@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

35896d96

mlxsw: spectrum: Use mapping of port being split for creating split ports · 7b39fa5b

Jiri Pirko authored Oct 31, 2019

Don't use constant max width value and instead of that, use the actual
width of the port. Also don't pass module value and use the value
stored in the same structure.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Shalom Toledo <shalomt@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7b39fa5b

mlxsw: spectrum: Replace port_to_module array with array of structs · 4a7f970f

Jiri Pirko authored Oct 31, 2019

Store the initial PMLP register configuration into array of structures
instead of just simple array of module numbers.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Shalom Toledo <shalomt@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4a7f970f

mlxsw: spectrum: Distinguish between unsplittable and split port · 26a6befa

Jiri Pirko authored Oct 31, 2019

Currently when user does split, he is not able to distinguish if the
port cannot be split because it is already split, or because it cannot
be split at all. Add another check for split flag to distinguish this.
Also add check forbidding split when maximal width is 1.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Shalom Toledo <shalomt@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

26a6befa

mlxsw: spectrum: Move max_width check up before count check · 2e6a2d7b

Jiri Pirko authored Oct 31, 2019

The fact that the port cannot be split further should be checked before
checking the count, so move it.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Shalom Toledo <shalomt@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2e6a2d7b

mlxsw: spectrum: Use PMTM register to get max module width · 25911e1b

Jiri Pirko authored Oct 31, 2019

Currently the max module width is hard-coded according to ASIC type.
That is not entirely correct, as the max module width might differ
per-board. Use PMTM register to query FW for maximal width of a module.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

25911e1b

mlxsw: reg: Add Port Module Type Mapping Register · a513b1a5

Jiri Pirko authored Oct 31, 2019

The PMTM allows query or configuration of module types.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Shalom Toledo <shalomt@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a513b1a5

mlxsw: reg: Extend PMLP tx/rx lane value size to 4 bits · 94e76837

Jiri Pirko authored Oct 31, 2019

The tx/rx lane fields got extended to 4 bits, update the reg field
description accordingly.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Shalom Toledo <shalomt@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

94e76837

cxgb4/l2t: Simplify 't4_l2e_free()' and '_t4_l2e_free()' · d74361dc

Christophe JAILLET authored Oct 31, 2019

Use '__skb_queue_purge()' instead of re-implementing it.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>

d74361dc

Merge branch 'Control-action-percpu-counters-allocation-by-netlink-flag' · d86784fe

David S. Miller authored Oct 30, 2019

Vlad Buslov says:

====================
Control action percpu counters allocation by netlink flag

Currently, significant fraction of CPU time during TC filter allocation
is spent in percpu allocator. Moreover, percpu allocator is protected
with single global mutex which negates any potential to improve its
performance by means of recent developments in TC filter update API that
removed rtnl lock for some Qdiscs and classifiers. In order to
significantly improve filter update rate and reduce memory usage we
would like to allow users to skip percpu counters allocation for
specific action if they don't expect high traffic rate hitting the
action, which is a reasonable expectation for hardware-offloaded setup.
In that case any potential gains to software fast-path performance
gained by usage of percpu-allocated counters compared to regular integer
counters protected by spinlock are not important, but amount of
additional CPU and memory consumed by them is significant.

In order to allow configuring action counters allocation type at
runtime, implement following changes:

- Implement helper functions to update the action counters and use them
  in affected actions instead of updating counters directly. This steps
  abstracts actions implementation from counter types that are being
  used for particular action instance at runtime.

- Modify the new helpers to use percpu counters if they were allocated
  during action initialization and use regular counters otherwise.

- Extend action UAPI TCA_ACT space with TCA_ACT_FLAGS field. Add
  TCA_ACT_FLAGS_NO_PERCPU_STATS action flag and update
  hardware-offloaded actions to not allocate percpu counters when the
  flag is set.

With this changes users that prefer action update slow-path speed over
software fast-path speed can dynamically request actions to skip percpu
counters allocation without affecting other users.

Now, lets look at actual performance gains provided by this change.
Simple test is used to measure insertion rate - iproute2 TC is executed
in parallel by xargs in batch mode, its total execution time is measured
by shell builtin "time" command. The command runs 20 concurrent tc
instances, each with its own batch file with 100k rules:

$ time ls add* | xargs -n 1 -P 20 sudo tc -b

Two main rule profiles are tested. First is simple L2 flower classifier
with single gact drop action. The configuration is chosen as worst case
scenario because with single-action rules pressure on percpu allocator
is minimized. Example rule:

filter add dev ens1f0 protocol ip ingress prio 1 handle 1 flower skip_hw
    src_mac e4:11:0:0:0:0 dst_mac e4:12:0:0:0:0 action drop

Second profile is typical real-world scenario that uses flower
classifier with some L2-4 fields and two actions (tunnel_key+mirred).
Example rule:

filter add dev ens1f0_0 protocol ip ingress prio 1 handle 1 flower
    skip_hw src_mac e4:11:0:0:0:0 dst_mac e4:12:0:0:0:0 src_ip
    192.168.111.1 dst_ip 192.168.111.2 ip_proto udp dst_port 1 src_port
    1 action tunnel_key set id 1 src_ip 2.2.2.2 dst_ip 2.2.2.3 dst_port
    4789 action mirred egress redirect dev vxlan1

 Profile           |        percpu |     no_percpu | X improvement
                   | (k rules/sec) | (k rules/sec) |
-------------------+---------------+---------------+---------------
 Gact drop         |           203 |           259 |          1.28
 tunnel_key+mirred |            92 |           204 |          2.22

For simple drop action removing percpu allocation leads to ~25%
insertion rate improvement. Perf profiles highlights the bottlenecks.

Perf profile of run with percpu allocation (gact drop):

+ 89.11% 0.48% tc [kernel.vmlinux] [k] entry_SYSCALL_64
+ 88.58% 0.04% tc [kernel.vmlinux] [k] do_syscall_64
+ 87.50% 0.04% tc libc-2.29.so [.] __libc_sendmsg
+ 86.96% 0.04% tc [kernel.vmlinux] [k] __sys_sendmsg
+ 86.85% 0.01% tc [kernel.vmlinux] [k] ___sys_sendmsg
+ 86.60% 0.05% tc [kernel.vmlinux] [k] sock_sendmsg
+ 86.55% 0.12% tc [kernel.vmlinux] [k] netlink_sendmsg
+ 86.04% 0.13% tc [kernel.vmlinux] [k] netlink_unicast
+ 85.42% 0.03% tc [kernel.vmlinux] [k] netlink_rcv_skb
+ 84.68% 0.04% tc [kernel.vmlinux] [k] rtnetlink_rcv_msg
+ 84.56% 0.24% tc [kernel.vmlinux] [k] tc_new_tfilter
+ 75.73% 0.65% tc [cls_flower] [k] fl_change
+ 71.30% 0.03% tc [kernel.vmlinux] [k] tcf_exts_validate
+ 71.27% 0.13% tc [kernel.vmlinux] [k] tcf_action_init
+ 71.06% 0.01% tc [kernel.vmlinux] [k] tcf_action_init_1
+ 70.41% 0.04% tc [act_gact] [k] tcf_gact_init
+ 53.59% 1.21% tc [kernel.vmlinux] [k] __mutex_lock.isra.0
+ 52.34% 0.34% tc [kernel.vmlinux] [k] tcf_idr_create
- 51.23% 2.17% tc [kernel.vmlinux] [k] pcpu_alloc
  - 49.05% pcpu_alloc
    + 39.35% __mutex_lock.isra.0 4.99% memset_erms
    + 2.16% pcpu_alloc_area
  + 2.17% __libc_sendmsg
+ 45.89% 44.33% tc [kernel.vmlinux] [k] osq_lock
+ 9.94% 0.04% tc [kernel.vmlinux] [k] tcf_idr_check_alloc
+ 7.76% 0.00% tc [kernel.vmlinux] [k] tcf_idr_insert
+ 6.50% 0.03% tc [kernel.vmlinux] [k] tfilter_notify
+ 6.24% 6.11% tc [kernel.vmlinux] [k] mutex_spin_on_owner
+ 5.73% 5.32% tc [kernel.vmlinux] [k] memset_erms
+ 5.31% 0.18% tc [kernel.vmlinux] [k] tcf_fill_node

Here bottleneck is clearly in pcpu_alloc() function that takes more than
half CPU time, which is mostly wasted busy-waiting for internal percpu
allocator global lock.

With percpu allocation removed (gact drop):

+ 87.50% 0.51% tc [kernel.vmlinux] [k] entry_SYSCALL_64
+ 86.94% 0.07% tc [kernel.vmlinux] [k] do_syscall_64
+ 85.75% 0.04% tc libc-2.29.so [.] __libc_sendmsg
+ 85.00% 0.07% tc [kernel.vmlinux] [k] __sys_sendmsg
+ 84.84% 0.07% tc [kernel.vmlinux] [k] ___sys_sendmsg
+ 84.59% 0.01% tc [kernel.vmlinux] [k] sock_sendmsg
+ 84.58% 0.14% tc [kernel.vmlinux] [k] netlink_sendmsg
+ 83.95% 0.12% tc [kernel.vmlinux] [k] netlink_unicast
+ 83.34% 0.01% tc [kernel.vmlinux] [k] netlink_rcv_skb
+ 82.39% 0.12% tc [kernel.vmlinux] [k] rtnetlink_rcv_msg
+ 82.16% 0.25% tc [kernel.vmlinux] [k] tc_new_tfilter
+ 75.13% 0.84% tc [cls_flower] [k] fl_change
+ 69.92% 0.05% tc [kernel.vmlinux] [k] tcf_exts_validate
+ 69.87% 0.11% tc [kernel.vmlinux] [k] tcf_action_init
+ 69.61% 0.02% tc [kernel.vmlinux] [k] tcf_action_init_1
- 68.80% 0.10% tc [act_gact] [k] tcf_gact_init
  - 68.70% tcf_gact_init
    + 36.08% tcf_idr_check_alloc
    + 31.88% tcf_idr_insert
+ 63.72% 0.58% tc [kernel.vmlinux] [k] __mutex_lock.isra.0
+ 58.80% 56.68% tc [kernel.vmlinux] [k] osq_lock
+ 36.08% 0.04% tc [kernel.vmlinux] [k] tcf_idr_check_alloc
+ 31.88% 0.01% tc [kernel.vmlinux] [k] tcf_idr_insert

The gact actions (like all other actions types) are inserted in single
idr instance protected by global (per namespace) lock that becomes new
bottleneck with such simple rule profile and prevents achieving 2x+
performance increase that can be expected by looking at profiling data
for insertion action with percpu counter.

Perf profile of run with percpu allocation (tunnel_key+mirred):

+ 91.95% 0.21% tc [kernel.vmlinux] [k] entry_SYSCALL_64
+ 91.74% 0.06% tc [kernel.vmlinux] [k] do_syscall_64
+ 90.74% 0.01% tc libc-2.29.so [.] __libc_sendmsg
+ 90.52% 0.01% tc [kernel.vmlinux] [k] __sys_sendmsg
+ 90.50% 0.04% tc [kernel.vmlinux] [k] ___sys_sendmsg
+ 90.41% 0.02% tc [kernel.vmlinux] [k] sock_sendmsg
+ 90.38% 0.04% tc [kernel.vmlinux] [k] netlink_sendmsg
+ 90.10% 0.06% tc [kernel.vmlinux] [k] netlink_unicast
+ 89.76% 0.01% tc [kernel.vmlinux] [k] netlink_rcv_skb
+ 89.28% 0.04% tc [kernel.vmlinux] [k] rtnetlink_rcv_msg
+ 89.15% 0.03% tc [kernel.vmlinux] [k] tc_new_tfilter
+ 83.41% 0.33% tc [cls_flower] [k] fl_change
+ 81.17% 0.04% tc [kernel.vmlinux] [k] tcf_exts_validate
+ 81.13% 0.06% tc [kernel.vmlinux] [k] tcf_action_init
+ 81.04% 0.04% tc [kernel.vmlinux] [k] tcf_action_init_1
- 73.59% 2.16% tc [kernel.vmlinux] [k] pcpu_alloc
  - 71.42% pcpu_alloc
    + 61.41% __mutex_lock.isra.0 5.02% memset_erms
    + 2.93% pcpu_alloc_area
  + 2.16% __libc_sendmsg
+ 63.58% 0.17% tc [kernel.vmlinux] [k] tcf_idr_create
+ 63.40% 0.60% tc [kernel.vmlinux] [k] __mutex_lock.isra.0
+ 57.85% 56.38% tc [kernel.vmlinux] [k] osq_lock
+ 46.27% 0.13% tc [act_tunnel_key] [k] tunnel_key_init
+ 34.26% 0.02% tc [act_mirred] [k] tcf_mirred_init
+ 10.99% 0.00% tc [kernel.vmlinux] [k] dst_cache_init
+ 5.32% 5.11% tc [kernel.vmlinux] [k] memset_erms

With two times more actions pressure on percpu allocator doubles, so now
it takes ~74% of CPU execution time.

With percpu allocation removed (tunnel_key+mirred):

+ 86.02% 0.50% tc [kernel.vmlinux] [k] entry_SYSCALL_64
+ 85.51% 0.12% tc [kernel.vmlinux] [k] do_syscall_64
+ 84.40% 0.03% tc libc-2.29.so [.] __libc_sendmsg
+ 83.84% 0.03% tc [kernel.vmlinux] [k] __sys_sendmsg
+ 83.72% 0.01% tc [kernel.vmlinux] [k] ___sys_sendmsg
+ 83.56% 0.01% tc [kernel.vmlinux] [k] sock_sendmsg
+ 83.50% 0.08% tc [kernel.vmlinux] [k] netlink_sendmsg
+ 83.02% 0.17% tc [kernel.vmlinux] [k] netlink_unicast
+ 82.48% 0.00% tc [kernel.vmlinux] [k] netlink_rcv_skb
+ 81.89% 0.11% tc [kernel.vmlinux] [k] rtnetlink_rcv_msg
+ 81.71% 0.25% tc [kernel.vmlinux] [k] tc_new_tfilter
+ 73.99% 0.63% tc [cls_flower] [k] fl_change
+ 69.72% 0.00% tc [kernel.vmlinux] [k] tcf_exts_validate
+ 69.72% 0.09% tc [kernel.vmlinux] [k] tcf_action_init
+ 69.53% 0.05% tc [kernel.vmlinux] [k] tcf_action_init_1
+ 53.08% 0.91% tc [kernel.vmlinux] [k] __mutex_lock.isra.0
+ 45.52% 43.99% tc [kernel.vmlinux] [k] osq_lock
- 36.02% 0.21% tc [act_tunnel_key] [k] tunnel_key_init
  - 35.81% tunnel_key_init
    + 15.95% tcf_idr_check_alloc
    + 13.91% tcf_idr_insert
    - 4.70% dst_cache_init
      + 4.68% pcpu_alloc
+ 33.22% 0.04% tc [kernel.vmlinux] [k] tcf_idr_check_alloc
+ 32.34% 0.05% tc [act_mirred] [k] tcf_mirred_init
+ 28.24% 0.01% tc [kernel.vmlinux] [k] tcf_idr_insert
+ 7.79% 0.05% tc [kernel.vmlinux] [k] idr_alloc_u32
+ 7.67% 7.35% tc [kernel.vmlinux] [k] idr_get_free
+ 6.46% 6.22% tc [kernel.vmlinux] [k] mutex_spin_on_owner
+ 5.11% 0.05% tc [kernel.vmlinux] [k] tfilter_notify

With percpu allocation removed insertion rate is increased by ~120%.
Such rule profile scales much better than simple single action because
both types of actions were competing for single lock in percpu
allocator, but not for action idr lock, which is per-action. Note that
percpu allocator is still used by dst_cache in tunnel_key actions and
consumes 4.68% CPU time. Dst_cache seems like good opportunity for
further insertion rate optimization but is not addressed by this change.

Another improvement provided by this change is significantly reduced
memory usage. The test is implemented by sampling "used memory" value
from "vmstat -s" command output. Following table includes memory usage
measurements for same two configurations that were used for measuring
insertion rate:

 Profile           | Mem per rule | Mem per rule no_percpu | Less memory used
                   |         (KB) |                   (KB) |             (KB)
-------------------+--------------+------------------------+------------------
 Gact drop         |         3.91 |                   2.51 |              1.4
 tunnel_key+mirred |         6.73 |                   3.91 |              2.8

Results indicate that memory usage of percpu allocator per action is
~1.4 KB. Note that any measurements of percpu allocator memory usage is
inherently tied to particular setup since memory usage is linear to
number of cores in system. It is to be expected that on current top of
the line servers percpu allocator memory usage will be 2-5x more than on
24 CPUs setup that was used for testing.

Setup details: 2x Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz, 32GB memory

Patches applied on top of net-next branch:

commit 2203cbf2 (net-next) Author:
Russell King <rmk+kernel@armlinux.org.uk> Date: Tue Oct 15 11:38:39 2019
+0100

net: sfp: move fwnode parsing into sfp-bus layer

Changes V1 -> V2:

- Include memory measurements.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

d86784fe