- 31 Oct, 2019 40 commits
-
-
Madalin Bucur authored
Remove cast, align variable name, simplify DMA map size computation. Signed-off-by: Madalin Bucur <madalin.bucur@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Madalin Bucur authored
Instead of reading skb fields, use information from the DPAA frame descriptor. Signed-off-by: Madalin Bucur <madalin.bucur@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Madalin Bucur authored
Avoid casts and repeated conversions. Signed-off-by: Madalin Bucur <madalin.bucur@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Madalin Bucur authored
The dpaa_cleanup_tx_fd() function is called by the frame transmit confirmation callback but also on several error paths. This function is reading the transmit timestamp value. Avoid reading an invalid timestamp value on the error paths. Fixes: 4664856e ("dpaa_eth: add support for hardware timestamping") Signed-off-by: Madalin Bucur <madalin.bucur@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Madalin Bucur authored
DMA unmapping is required before accessing the HW provided timestamping information. Fixes: 4664856e ("dpaa_eth: add support for hardware timestamping") Signed-off-by: Madalin Bucur <madalin.bucur@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Madalin Bucur authored
Change the buffers used for reception from netdev_frags to pages. Signed-off-by: Madalin Bucur <madalin.bucur@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Madalin Bucur authored
Currently the DPAA Ethernet driver is using three buffer pools for each interface, with three different sizes for the buffers provided for the FMan reception path. This patch reduces the number of buffer pools to one per interface. This change is in preparation of another, that will be switching from netdev_frags to page backed buffers for the receive path. Signed-off-by: Madalin Bucur <madalin.bucur@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Huazhong Tan says: ==================== net: hns3: add some optimizations and cleanups This series adds some code optimizations and cleanups for the HNS3 ethernet driver. [patch 1/9] dumps some debug information when reset fail. [patch 2/9] dumps some struct netdev_queue information when TX timeout. [patch 3/9] cleanups some magic numbers. [patch 4/9] cleanups some coding style issue. [patch 5/9] fixes a compiler warning. [patch 6/9] optimizes some local variable initialization. [patch 7/9] modifies some comments. [patch 8/9] cleanups some print format warnings. [patch 9/9] cleanups byte order issue. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Guojia Liao authored
Though the hip08 and the IMP(Intelligent Management Processor) have the same byte order right now, it is better to convert __be or __le variable into the CPU's byte order before print. Signed-off-by: Guojia Liao <liaoguojia@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Guojia Liao authored
Using '%d' for printing type unsigned int or '%u' for type int would cause static tools to give false warnings, so this patch cleanups this warning by using the suitable format specifier of the type of variable. BTW, modifies the type of some variables and macro to synchronize with their usage. Signed-off-by: Guojia Liao <liaoguojia@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Guangbin Huang authored
This patch makes the comment for macro HCLGE_MBX_GET_VF_FLR_STATUS more correct, and adds comments in some place to make the code more readable. Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Guangbin Huang authored
The variable tx_ring is unnecessary to be initialized as it will be set before used, and the variable rst_cnt is better to be initialized when declaration for simplification. Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Guojia Liao authored
In hns3_nic_init_irq(), when '*_int_idx' has more than 9 digits and the length of netdev's name is IFNAMSIZ, the total length of final name will be bigger the HNAE3_INT_NAME_LEN - 1, even though '*_int_idx' will never have such large value, but the compiler gives a format-truncation warning for this case. So this patch just enlarges the length to avoid this warning. Signed-off-by: Guojia Liao <liaoguojia@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Guangbin Huang authored
To unify code style and make code simpler, this patch modifies some code, deletes unnecessary blank lines and {}, changes location of code, and so on. No functional change. Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Guojia Liao authored
To make the code more readable, this patch replaces some magic numbers with macro or sizeof operation. Also uses macro lower_32_bits and upper_32_bits to get bits 0-31 and 32-63 of a number, instead of using type conversion and '>>' operation. No functional change. Signed-off-by: Guojia Liao <liaoguojia@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Yunsheng Lin authored
When there is a TX timeout, we can tell if the driver or stack has stopped the queue by looking at state field, and when has the last packet transmited by looking at trans_start field. So this patch prints these two field in the hns3_get_tx_timeo_queue_info(). Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Huazhong Tan authored
When reset fails, there is some information that will help for finding out why does reset fail. and removes an unused core_rst_cnt field in struct hclge_rst_stats. Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Sheetal Tigadoli says: ==================== bnxt_en: Add OP-TEE based bnxt f/w manager This patch series adds support for TEE based BNXT firmware management module and the driver changes to invoke OP-TEE APIs to fastboot firmware and to collect crash dump. Changes from v4: - update Kconfig to reflect dependency on TEE driver ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vasundhara Volam authored
Driver supports 2 types of core dumps. 1. Live dump - Firmware dump when system is up and running. 2. Crash dump - Dump which is collected during firmware crash that can be retrieved after recovery. Crash dump is currently supported only on specific 58800 chips which can be retrieved using OP-TEE API only, as firmware cannot access this region directly. User needs to set the dump flag using following command before initiating the dump collection: $ ethtool -W|--set-dump eth0 N Where N is "0" for live dump and "1" for crash dump Command to collect the dump after setting the flag: $ ethtool -w eth0 data Filename v3: Modify set_dump to support even when CONFIG_TEE_BNXT_FW=n. Also change log message to netdev_info(). Cc: Jakub Kicinski <jakub.kicinski@netronome.com> Cc: Michael Chan <michael.chan@broadcom.com> Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com> Signed-off-by: Sheetal Tigadoli <sheetal.tigadoli@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vasundhara Volam authored
In error recovery process when firmware indicates that it is completely down, initiate a firmware reset by calling OP-TEE API. Cc: Michael Chan <michael.chan@broadcom.com> Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com> Signed-off-by: Sheetal Tigadoli <sheetal.tigadoli@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vikas Gupta authored
This driver registers on TEE bus to interact with OP-TEE based BNXT firmware management modules Cc: Jakub Kicinski <jakub.kicinski@netronome.com> Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com> Signed-off-by: Sheetal Tigadoli <sheetal.tigadoli@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Ido Schimmel says: ==================== mlxsw: Make port split code more generic Jiri says: Currently, we assume some limitations and constant values which are not applicable for Spectrum-3 which has 8 lanes ports (instead of previous 4 lanes). This patch does 2 things: 1) Generalizes the code to not use constants so it can work for 4, 8 and possibly 16 lanes. 2) Enforces some assumptions we had in the code but did not check. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
Make the check generic for any possible value, not only 2 and 4. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Shalom Toledo <shalomt@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
During recreation of original unsplit ports, just simply iterate over the whole gap and recreate whatever originally existed. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Shalom Toledo <shalomt@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
The current code considers only split by 2 or 4. Make the base port getting generic and allow split by 8 to be handled correctly. Generalize the used port checks as well. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Shalom Toledo <shalomt@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
Instead of using constant value, use port_module_max_width which is aligned with the cluster size. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Shalom Toledo <shalomt@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
Don't compute the original base local port during unsplit, rather remember it in mlxsw_sp_port structure during split port creation. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Shalom Toledo <shalomt@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
In Spectrum-3 the modules have 8 lanes, so split by count 2 results in two split ports each of 4 lanes. Add a resource that can be used to obtain local port offset in that case. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
Get local port offsets of split port in a separate helper function and use it in both split and unsplit function. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Shalom Toledo <shalomt@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
Driver assumes certain values in the PMLP register. Add checks that verify that PMLP register provides fitting values. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Shalom Toledo <shalomt@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
Pass the port mapping structure down to create, module_map and other function instead of individual values. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Shalom Toledo <shalomt@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
Don't use constant max width value and instead of that, use the actual width of the port. Also don't pass module value and use the value stored in the same structure. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Shalom Toledo <shalomt@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
Store the initial PMLP register configuration into array of structures instead of just simple array of module numbers. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Shalom Toledo <shalomt@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
Currently when user does split, he is not able to distinguish if the port cannot be split because it is already split, or because it cannot be split at all. Add another check for split flag to distinguish this. Also add check forbidding split when maximal width is 1. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Shalom Toledo <shalomt@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
The fact that the port cannot be split further should be checked before checking the count, so move it. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Shalom Toledo <shalomt@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
Currently the max module width is hard-coded according to ASIC type. That is not entirely correct, as the max module width might differ per-board. Use PMTM register to query FW for maximal width of a module. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
The PMTM allows query or configuration of module types. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Shalom Toledo <shalomt@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
The tx/rx lane fields got extended to 4 bits, update the reg field description accordingly. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Shalom Toledo <shalomt@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Christophe JAILLET authored
Use '__skb_queue_purge()' instead of re-implementing it. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Vlad Buslov says: ==================== Control action percpu counters allocation by netlink flag Currently, significant fraction of CPU time during TC filter allocation is spent in percpu allocator. Moreover, percpu allocator is protected with single global mutex which negates any potential to improve its performance by means of recent developments in TC filter update API that removed rtnl lock for some Qdiscs and classifiers. In order to significantly improve filter update rate and reduce memory usage we would like to allow users to skip percpu counters allocation for specific action if they don't expect high traffic rate hitting the action, which is a reasonable expectation for hardware-offloaded setup. In that case any potential gains to software fast-path performance gained by usage of percpu-allocated counters compared to regular integer counters protected by spinlock are not important, but amount of additional CPU and memory consumed by them is significant. In order to allow configuring action counters allocation type at runtime, implement following changes: - Implement helper functions to update the action counters and use them in affected actions instead of updating counters directly. This steps abstracts actions implementation from counter types that are being used for particular action instance at runtime. - Modify the new helpers to use percpu counters if they were allocated during action initialization and use regular counters otherwise. - Extend action UAPI TCA_ACT space with TCA_ACT_FLAGS field. Add TCA_ACT_FLAGS_NO_PERCPU_STATS action flag and update hardware-offloaded actions to not allocate percpu counters when the flag is set. With this changes users that prefer action update slow-path speed over software fast-path speed can dynamically request actions to skip percpu counters allocation without affecting other users. Now, lets look at actual performance gains provided by this change. Simple test is used to measure insertion rate - iproute2 TC is executed in parallel by xargs in batch mode, its total execution time is measured by shell builtin "time" command. The command runs 20 concurrent tc instances, each with its own batch file with 100k rules: $ time ls add* | xargs -n 1 -P 20 sudo tc -b Two main rule profiles are tested. First is simple L2 flower classifier with single gact drop action. The configuration is chosen as worst case scenario because with single-action rules pressure on percpu allocator is minimized. Example rule: filter add dev ens1f0 protocol ip ingress prio 1 handle 1 flower skip_hw src_mac e4:11:0:0:0:0 dst_mac e4:12:0:0:0:0 action drop Second profile is typical real-world scenario that uses flower classifier with some L2-4 fields and two actions (tunnel_key+mirred). Example rule: filter add dev ens1f0_0 protocol ip ingress prio 1 handle 1 flower skip_hw src_mac e4:11:0:0:0:0 dst_mac e4:12:0:0:0:0 src_ip 192.168.111.1 dst_ip 192.168.111.2 ip_proto udp dst_port 1 src_port 1 action tunnel_key set id 1 src_ip 2.2.2.2 dst_ip 2.2.2.3 dst_port 4789 action mirred egress redirect dev vxlan1 Profile | percpu | no_percpu | X improvement | (k rules/sec) | (k rules/sec) | -------------------+---------------+---------------+--------------- Gact drop | 203 | 259 | 1.28 tunnel_key+mirred | 92 | 204 | 2.22 For simple drop action removing percpu allocation leads to ~25% insertion rate improvement. Perf profiles highlights the bottlenecks. Perf profile of run with percpu allocation (gact drop): + 89.11% 0.48% tc [kernel.vmlinux] [k] entry_SYSCALL_64 + 88.58% 0.04% tc [kernel.vmlinux] [k] do_syscall_64 + 87.50% 0.04% tc libc-2.29.so [.] __libc_sendmsg + 86.96% 0.04% tc [kernel.vmlinux] [k] __sys_sendmsg + 86.85% 0.01% tc [kernel.vmlinux] [k] ___sys_sendmsg + 86.60% 0.05% tc [kernel.vmlinux] [k] sock_sendmsg + 86.55% 0.12% tc [kernel.vmlinux] [k] netlink_sendmsg + 86.04% 0.13% tc [kernel.vmlinux] [k] netlink_unicast + 85.42% 0.03% tc [kernel.vmlinux] [k] netlink_rcv_skb + 84.68% 0.04% tc [kernel.vmlinux] [k] rtnetlink_rcv_msg + 84.56% 0.24% tc [kernel.vmlinux] [k] tc_new_tfilter + 75.73% 0.65% tc [cls_flower] [k] fl_change + 71.30% 0.03% tc [kernel.vmlinux] [k] tcf_exts_validate + 71.27% 0.13% tc [kernel.vmlinux] [k] tcf_action_init + 71.06% 0.01% tc [kernel.vmlinux] [k] tcf_action_init_1 + 70.41% 0.04% tc [act_gact] [k] tcf_gact_init + 53.59% 1.21% tc [kernel.vmlinux] [k] __mutex_lock.isra.0 + 52.34% 0.34% tc [kernel.vmlinux] [k] tcf_idr_create - 51.23% 2.17% tc [kernel.vmlinux] [k] pcpu_alloc - 49.05% pcpu_alloc + 39.35% __mutex_lock.isra.0 4.99% memset_erms + 2.16% pcpu_alloc_area + 2.17% __libc_sendmsg + 45.89% 44.33% tc [kernel.vmlinux] [k] osq_lock + 9.94% 0.04% tc [kernel.vmlinux] [k] tcf_idr_check_alloc + 7.76% 0.00% tc [kernel.vmlinux] [k] tcf_idr_insert + 6.50% 0.03% tc [kernel.vmlinux] [k] tfilter_notify + 6.24% 6.11% tc [kernel.vmlinux] [k] mutex_spin_on_owner + 5.73% 5.32% tc [kernel.vmlinux] [k] memset_erms + 5.31% 0.18% tc [kernel.vmlinux] [k] tcf_fill_node Here bottleneck is clearly in pcpu_alloc() function that takes more than half CPU time, which is mostly wasted busy-waiting for internal percpu allocator global lock. With percpu allocation removed (gact drop): + 87.50% 0.51% tc [kernel.vmlinux] [k] entry_SYSCALL_64 + 86.94% 0.07% tc [kernel.vmlinux] [k] do_syscall_64 + 85.75% 0.04% tc libc-2.29.so [.] __libc_sendmsg + 85.00% 0.07% tc [kernel.vmlinux] [k] __sys_sendmsg + 84.84% 0.07% tc [kernel.vmlinux] [k] ___sys_sendmsg + 84.59% 0.01% tc [kernel.vmlinux] [k] sock_sendmsg + 84.58% 0.14% tc [kernel.vmlinux] [k] netlink_sendmsg + 83.95% 0.12% tc [kernel.vmlinux] [k] netlink_unicast + 83.34% 0.01% tc [kernel.vmlinux] [k] netlink_rcv_skb + 82.39% 0.12% tc [kernel.vmlinux] [k] rtnetlink_rcv_msg + 82.16% 0.25% tc [kernel.vmlinux] [k] tc_new_tfilter + 75.13% 0.84% tc [cls_flower] [k] fl_change + 69.92% 0.05% tc [kernel.vmlinux] [k] tcf_exts_validate + 69.87% 0.11% tc [kernel.vmlinux] [k] tcf_action_init + 69.61% 0.02% tc [kernel.vmlinux] [k] tcf_action_init_1 - 68.80% 0.10% tc [act_gact] [k] tcf_gact_init - 68.70% tcf_gact_init + 36.08% tcf_idr_check_alloc + 31.88% tcf_idr_insert + 63.72% 0.58% tc [kernel.vmlinux] [k] __mutex_lock.isra.0 + 58.80% 56.68% tc [kernel.vmlinux] [k] osq_lock + 36.08% 0.04% tc [kernel.vmlinux] [k] tcf_idr_check_alloc + 31.88% 0.01% tc [kernel.vmlinux] [k] tcf_idr_insert The gact actions (like all other actions types) are inserted in single idr instance protected by global (per namespace) lock that becomes new bottleneck with such simple rule profile and prevents achieving 2x+ performance increase that can be expected by looking at profiling data for insertion action with percpu counter. Perf profile of run with percpu allocation (tunnel_key+mirred): + 91.95% 0.21% tc [kernel.vmlinux] [k] entry_SYSCALL_64 + 91.74% 0.06% tc [kernel.vmlinux] [k] do_syscall_64 + 90.74% 0.01% tc libc-2.29.so [.] __libc_sendmsg + 90.52% 0.01% tc [kernel.vmlinux] [k] __sys_sendmsg + 90.50% 0.04% tc [kernel.vmlinux] [k] ___sys_sendmsg + 90.41% 0.02% tc [kernel.vmlinux] [k] sock_sendmsg + 90.38% 0.04% tc [kernel.vmlinux] [k] netlink_sendmsg + 90.10% 0.06% tc [kernel.vmlinux] [k] netlink_unicast + 89.76% 0.01% tc [kernel.vmlinux] [k] netlink_rcv_skb + 89.28% 0.04% tc [kernel.vmlinux] [k] rtnetlink_rcv_msg + 89.15% 0.03% tc [kernel.vmlinux] [k] tc_new_tfilter + 83.41% 0.33% tc [cls_flower] [k] fl_change + 81.17% 0.04% tc [kernel.vmlinux] [k] tcf_exts_validate + 81.13% 0.06% tc [kernel.vmlinux] [k] tcf_action_init + 81.04% 0.04% tc [kernel.vmlinux] [k] tcf_action_init_1 - 73.59% 2.16% tc [kernel.vmlinux] [k] pcpu_alloc - 71.42% pcpu_alloc + 61.41% __mutex_lock.isra.0 5.02% memset_erms + 2.93% pcpu_alloc_area + 2.16% __libc_sendmsg + 63.58% 0.17% tc [kernel.vmlinux] [k] tcf_idr_create + 63.40% 0.60% tc [kernel.vmlinux] [k] __mutex_lock.isra.0 + 57.85% 56.38% tc [kernel.vmlinux] [k] osq_lock + 46.27% 0.13% tc [act_tunnel_key] [k] tunnel_key_init + 34.26% 0.02% tc [act_mirred] [k] tcf_mirred_init + 10.99% 0.00% tc [kernel.vmlinux] [k] dst_cache_init + 5.32% 5.11% tc [kernel.vmlinux] [k] memset_erms With two times more actions pressure on percpu allocator doubles, so now it takes ~74% of CPU execution time. With percpu allocation removed (tunnel_key+mirred): + 86.02% 0.50% tc [kernel.vmlinux] [k] entry_SYSCALL_64 + 85.51% 0.12% tc [kernel.vmlinux] [k] do_syscall_64 + 84.40% 0.03% tc libc-2.29.so [.] __libc_sendmsg + 83.84% 0.03% tc [kernel.vmlinux] [k] __sys_sendmsg + 83.72% 0.01% tc [kernel.vmlinux] [k] ___sys_sendmsg + 83.56% 0.01% tc [kernel.vmlinux] [k] sock_sendmsg + 83.50% 0.08% tc [kernel.vmlinux] [k] netlink_sendmsg + 83.02% 0.17% tc [kernel.vmlinux] [k] netlink_unicast + 82.48% 0.00% tc [kernel.vmlinux] [k] netlink_rcv_skb + 81.89% 0.11% tc [kernel.vmlinux] [k] rtnetlink_rcv_msg + 81.71% 0.25% tc [kernel.vmlinux] [k] tc_new_tfilter + 73.99% 0.63% tc [cls_flower] [k] fl_change + 69.72% 0.00% tc [kernel.vmlinux] [k] tcf_exts_validate + 69.72% 0.09% tc [kernel.vmlinux] [k] tcf_action_init + 69.53% 0.05% tc [kernel.vmlinux] [k] tcf_action_init_1 + 53.08% 0.91% tc [kernel.vmlinux] [k] __mutex_lock.isra.0 + 45.52% 43.99% tc [kernel.vmlinux] [k] osq_lock - 36.02% 0.21% tc [act_tunnel_key] [k] tunnel_key_init - 35.81% tunnel_key_init + 15.95% tcf_idr_check_alloc + 13.91% tcf_idr_insert - 4.70% dst_cache_init + 4.68% pcpu_alloc + 33.22% 0.04% tc [kernel.vmlinux] [k] tcf_idr_check_alloc + 32.34% 0.05% tc [act_mirred] [k] tcf_mirred_init + 28.24% 0.01% tc [kernel.vmlinux] [k] tcf_idr_insert + 7.79% 0.05% tc [kernel.vmlinux] [k] idr_alloc_u32 + 7.67% 7.35% tc [kernel.vmlinux] [k] idr_get_free + 6.46% 6.22% tc [kernel.vmlinux] [k] mutex_spin_on_owner + 5.11% 0.05% tc [kernel.vmlinux] [k] tfilter_notify With percpu allocation removed insertion rate is increased by ~120%. Such rule profile scales much better than simple single action because both types of actions were competing for single lock in percpu allocator, but not for action idr lock, which is per-action. Note that percpu allocator is still used by dst_cache in tunnel_key actions and consumes 4.68% CPU time. Dst_cache seems like good opportunity for further insertion rate optimization but is not addressed by this change. Another improvement provided by this change is significantly reduced memory usage. The test is implemented by sampling "used memory" value from "vmstat -s" command output. Following table includes memory usage measurements for same two configurations that were used for measuring insertion rate: Profile | Mem per rule | Mem per rule no_percpu | Less memory used | (KB) | (KB) | (KB) -------------------+--------------+------------------------+------------------ Gact drop | 3.91 | 2.51 | 1.4 tunnel_key+mirred | 6.73 | 3.91 | 2.8 Results indicate that memory usage of percpu allocator per action is ~1.4 KB. Note that any measurements of percpu allocator memory usage is inherently tied to particular setup since memory usage is linear to number of cores in system. It is to be expected that on current top of the line servers percpu allocator memory usage will be 2-5x more than on 24 CPUs setup that was used for testing. Setup details: 2x Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz, 32GB memory Patches applied on top of net-next branch: commit 2203cbf2 (net-next) Author: Russell King <rmk+kernel@armlinux.org.uk> Date: Tue Oct 15 11:38:39 2019 +0100 net: sfp: move fwnode parsing into sfp-bus layer Changes V1 -> V2: - Include memory measurements. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-