Merge branch 'Control-action-percpu-counters-allocation-by-netlink-flag'

Vlad Buslov says: ==================== Control action percpu counters allocation by netlink flag Currently, significant fraction of CPU time during TC filter allocation is spent in percpu allocator. Moreover, percpu allocator is protected with single global mutex which negates any potential to improve its performance by means of recent developments in TC filter update API that removed rtnl lock for some Qdiscs and classifiers. In order to significantly improve filter update rate and reduce memory usage we would like to allow users to skip percpu counters allocation for specific action if they don't expect high traffic rate hitting the action, which is a reasonable expectation for hardware-offloaded setup. In that case any potential gains to software fast-path performance gained by usage of percpu-allocated counters compared to regular integer counters protected by spinlock are not important, but amount of additional CPU and memory consumed by them is significant. In order to allow configuring action counters allocation type at runtime, implement following changes: - Implement helper functions to update the action counters and use them in affected actions instead of updating counters directly. This steps abstracts actions implementation from counter types that are being used for particular action instance at runtime. - Modify the new helpers to use percpu counters if they were allocated during action initialization and use regular counters otherwise. - Extend action UAPI TCA_ACT space with TCA_ACT_FLAGS field. Add TCA_ACT_FLAGS_NO_PERCPU_STATS action flag and update hardware-offloaded actions to not allocate percpu counters when the flag is set. With this changes users that prefer action update slow-path speed over software fast-path speed can dynamically request actions to skip percpu counters allocation without affecting other users. Now, lets look at actual performance gains provided by this change. Simple test is used to measure insertion rate - iproute2 TC is executed in parallel by xargs in batch mode, its total execution time is measured by shell builtin "time" command. The command runs 20 concurrent tc instances, each with its own batch file with 100k rules: $ time ls add* | xargs -n 1 -P 20 sudo tc -b Two main rule profiles are tested. First is simple L2 flower classifier with single gact drop action. The configuration is chosen as worst case scenario because with single-action rules pressure on percpu allocator is minimized. Example rule: filter add dev ens1f0 protocol ip ingress prio 1 handle 1 flower skip_hw src_mac e4:11:0:0:0:0 dst_mac e4:12:0:0:0:0 action drop Second profile is typical real-world scenario that uses flower classifier with some L2-4 fields and two actions (tunnel_key+mirred). Example rule: filter add dev ens1f0_0 protocol ip ingress prio 1 handle 1 flower skip_hw src_mac e4:11:0:0:0:0 dst_mac e4:12:0:0:0:0 src_ip 192.168.111.1 dst_ip 192.168.111.2 ip_proto udp dst_port 1 src_port 1 action tunnel_key set id 1 src_ip 2.2.2.2 dst_ip 2.2.2.3 dst_port 4789 action mirred egress redirect dev vxlan1 Profile | percpu | no_percpu | X improvement | (k rules/sec) | (k rules/sec) | -------------------+---------------+---------------+--------------- Gact drop | 203 | 259 | 1.28 tunnel_key+mirred | 92 | 204 | 2.22 For simple drop action removing percpu allocation leads to ~25% insertion rate improvement. Perf profiles highlights the bottlenecks. Perf profile of run with percpu allocation (gact drop): + 89.11% 0.48% tc [kernel.vmlinux] [k] entry_SYSCALL_64 + 88.58% 0.04% tc [kernel.vmlinux] [k] do_syscall_64 + 87.50% 0.04% tc libc-2.29.so [.] __libc_sendmsg + 86.96% 0.04% tc [kernel.vmlinux] [k] __sys_sendmsg + 86.85% 0.01% tc [kernel.vmlinux] [k] ___sys_sendmsg + 86.60% 0.05% tc [kernel.vmlinux] [k] sock_sendmsg + 86.55% 0.12% tc [kernel.vmlinux] [k] netlink_sendmsg + 86.04% 0.13% tc [kernel.vmlinux] [k] netlink_unicast + 85.42% 0.03% tc [kernel.vmlinux] [k] netlink_rcv_skb + 84.68% 0.04% tc [kernel.vmlinux] [k] rtnetlink_rcv_msg + 84.56% 0.24% tc [kernel.vmlinux] [k] tc_new_tfilter + 75.73% 0.65% tc [cls_flower] [k] fl_change + 71.30% 0.03% tc [kernel.vmlinux] [k] tcf_exts_validate + 71.27% 0.13% tc [kernel.vmlinux] [k] tcf_action_init + 71.06% 0.01% tc [kernel.vmlinux] [k] tcf_action_init_1 + 70.41% 0.04% tc [act_gact] [k] tcf_gact_init + 53.59% 1.21% tc [kernel.vmlinux] [k] __mutex_lock.isra.0 + 52.34% 0.34% tc [kernel.vmlinux] [k] tcf_idr_create - 51.23% 2.17% tc [kernel.vmlinux] [k] pcpu_alloc - 49.05% pcpu_alloc + 39.35% __mutex_lock.isra.0 4.99% memset_erms + 2.16% pcpu_alloc_area + 2.17% __libc_sendmsg + 45.89% 44.33% tc [kernel.vmlinux] [k] osq_lock + 9.94% 0.04% tc [kernel.vmlinux] [k] tcf_idr_check_alloc + 7.76% 0.00% tc [kernel.vmlinux] [k] tcf_idr_insert + 6.50% 0.03% tc [kernel.vmlinux] [k] tfilter_notify + 6.24% 6.11% tc [kernel.vmlinux] [k] mutex_spin_on_owner + 5.73% 5.32% tc [kernel.vmlinux] [k] memset_erms + 5.31% 0.18% tc [kernel.vmlinux] [k] tcf_fill_node Here bottleneck is clearly in pcpu_alloc() function that takes more than half CPU time, which is mostly wasted busy-waiting for internal percpu allocator global lock. With percpu allocation removed (gact drop): + 87.50% 0.51% tc [kernel.vmlinux] [k] entry_SYSCALL_64 + 86.94% 0.07% tc [kernel.vmlinux] [k] do_syscall_64 + 85.75% 0.04% tc libc-2.29.so [.] __libc_sendmsg + 85.00% 0.07% tc [kernel.vmlinux] [k] __sys_sendmsg + 84.84% 0.07% tc [kernel.vmlinux] [k] ___sys_sendmsg + 84.59% 0.01% tc [kernel.vmlinux] [k] sock_sendmsg + 84.58% 0.14% tc [kernel.vmlinux] [k] netlink_sendmsg + 83.95% 0.12% tc [kernel.vmlinux] [k] netlink_unicast + 83.34% 0.01% tc [kernel.vmlinux] [k] netlink_rcv_skb + 82.39% 0.12% tc [kernel.vmlinux] [k] rtnetlink_rcv_msg + 82.16% 0.25% tc [kernel.vmlinux] [k] tc_new_tfilter + 75.13% 0.84% tc [cls_flower] [k] fl_change + 69.92% 0.05% tc [kernel.vmlinux] [k] tcf_exts_validate + 69.87% 0.11% tc [kernel.vmlinux] [k] tcf_action_init + 69.61% 0.02% tc [kernel.vmlinux] [k] tcf_action_init_1 - 68.80% 0.10% tc [act_gact] [k] tcf_gact_init - 68.70% tcf_gact_init + 36.08% tcf_idr_check_alloc + 31.88% tcf_idr_insert + 63.72% 0.58% tc [kernel.vmlinux] [k] __mutex_lock.isra.0 + 58.80% 56.68% tc [kernel.vmlinux] [k] osq_lock + 36.08% 0.04% tc [kernel.vmlinux] [k] tcf_idr_check_alloc + 31.88% 0.01% tc [kernel.vmlinux] [k] tcf_idr_insert The gact actions (like all other actions types) are inserted in single idr instance protected by global (per namespace) lock that becomes new bottleneck with such simple rule profile and prevents achieving 2x+ performance increase that can be expected by looking at profiling data for insertion action with percpu counter. Perf profile of run with percpu allocation (tunnel_key+mirred): + 91.95% 0.21% tc [kernel.vmlinux] [k] entry_SYSCALL_64 + 91.74% 0.06% tc [kernel.vmlinux] [k] do_syscall_64 + 90.74% 0.01% tc libc-2.29.so [.] __libc_sendmsg + 90.52% 0.01% tc [kernel.vmlinux] [k] __sys_sendmsg + 90.50% 0.04% tc [kernel.vmlinux] [k] ___sys_sendmsg + 90.41% 0.02% tc [kernel.vmlinux] [k] sock_sendmsg + 90.38% 0.04% tc [kernel.vmlinux] [k] netlink_sendmsg + 90.10% 0.06% tc [kernel.vmlinux] [k] netlink_unicast + 89.76% 0.01% tc [kernel.vmlinux] [k] netlink_rcv_skb + 89.28% 0.04% tc [kernel.vmlinux] [k] rtnetlink_rcv_msg + 89.15% 0.03% tc [kernel.vmlinux] [k] tc_new_tfilter + 83.41% 0.33% tc [cls_flower] [k] fl_change + 81.17% 0.04% tc [kernel.vmlinux] [k] tcf_exts_validate + 81.13% 0.06% tc [kernel.vmlinux] [k] tcf_action_init + 81.04% 0.04% tc [kernel.vmlinux] [k] tcf_action_init_1 - 73.59% 2.16% tc [kernel.vmlinux] [k] pcpu_alloc - 71.42% pcpu_alloc + 61.41% __mutex_lock.isra.0 5.02% memset_erms + 2.93% pcpu_alloc_area + 2.16% __libc_sendmsg + 63.58% 0.17% tc [kernel.vmlinux] [k] tcf_idr_create + 63.40% 0.60% tc [kernel.vmlinux] [k] __mutex_lock.isra.0 + 57.85% 56.38% tc [kernel.vmlinux] [k] osq_lock + 46.27% 0.13% tc [act_tunnel_key] [k] tunnel_key_init + 34.26% 0.02% tc [act_mirred] [k] tcf_mirred_init + 10.99% 0.00% tc [kernel.vmlinux] [k] dst_cache_init + 5.32% 5.11% tc [kernel.vmlinux] [k] memset_erms With two times more actions pressure on percpu allocator doubles, so now it takes ~74% of CPU execution time. With percpu allocation removed (tunnel_key+mirred): + 86.02% 0.50% tc [kernel.vmlinux] [k] entry_SYSCALL_64 + 85.51% 0.12% tc [kernel.vmlinux] [k] do_syscall_64 + 84.40% 0.03% tc libc-2.29.so [.] __libc_sendmsg + 83.84% 0.03% tc [kernel.vmlinux] [k] __sys_sendmsg + 83.72% 0.01% tc [kernel.vmlinux] [k] ___sys_sendmsg + 83.56% 0.01% tc [kernel.vmlinux] [k] sock_sendmsg + 83.50% 0.08% tc [kernel.vmlinux] [k] netlink_sendmsg + 83.02% 0.17% tc [kernel.vmlinux] [k] netlink_unicast + 82.48% 0.00% tc [kernel.vmlinux] [k] netlink_rcv_skb + 81.89% 0.11% tc [kernel.vmlinux] [k] rtnetlink_rcv_msg + 81.71% 0.25% tc [kernel.vmlinux] [k] tc_new_tfilter + 73.99% 0.63% tc [cls_flower] [k] fl_change + 69.72% 0.00% tc [kernel.vmlinux] [k] tcf_exts_validate + 69.72% 0.09% tc [kernel.vmlinux] [k] tcf_action_init + 69.53% 0.05% tc [kernel.vmlinux] [k] tcf_action_init_1 + 53.08% 0.91% tc [kernel.vmlinux] [k] __mutex_lock.isra.0 + 45.52% 43.99% tc [kernel.vmlinux] [k] osq_lock - 36.02% 0.21% tc [act_tunnel_key] [k] tunnel_key_init - 35.81% tunnel_key_init + 15.95% tcf_idr_check_alloc + 13.91% tcf_idr_insert - 4.70% dst_cache_init + 4.68% pcpu_alloc + 33.22% 0.04% tc [kernel.vmlinux] [k] tcf_idr_check_alloc + 32.34% 0.05% tc [act_mirred] [k] tcf_mirred_init + 28.24% 0.01% tc [kernel.vmlinux] [k] tcf_idr_insert + 7.79% 0.05% tc [kernel.vmlinux] [k] idr_alloc_u32 + 7.67% 7.35% tc [kernel.vmlinux] [k] idr_get_free + 6.46% 6.22% tc [kernel.vmlinux] [k] mutex_spin_on_owner + 5.11% 0.05% tc [kernel.vmlinux] [k] tfilter_notify With percpu allocation removed insertion rate is increased by ~120%. Such rule profile scales much better than simple single action because both types of actions were competing for single lock in percpu allocator, but not for action idr lock, which is per-action. Note that percpu allocator is still used by dst_cache in tunnel_key actions and consumes 4.68% CPU time. Dst_cache seems like good opportunity for further insertion rate optimization but is not addressed by this change. Another improvement provided by this change is significantly reduced memory usage. The test is implemented by sampling "used memory" value from "vmstat -s" command output. Following table includes memory usage measurements for same two configurations that were used for measuring insertion rate: Profile | Mem per rule | Mem per rule no_percpu | Less memory used | (KB) | (KB) | (KB) -------------------+--------------+------------------------+------------------ Gact drop | 3.91 | 2.51 | 1.4 tunnel_key+mirred | 6.73 | 3.91 | 2.8 Results indicate that memory usage of percpu allocator per action is ~1.4 KB. Note that any measurements of percpu allocator memory usage is inherently tied to particular setup since memory usage is linear to number of cores in system. It is to be expected that on current top of the line servers percpu allocator memory usage will be 2-5x more than on 24 CPUs setup that was used for testing. Setup details: 2x Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz, 32GB memory Patches applied on top of net-next branch: commit 2203cbf2 (net-next) Author: Russell King <rmk+kernel@armlinux.org.uk> Date: Tue Oct 15 11:38:39 2019 +0100 net: sfp: move fwnode parsing into sfp-bus layer Changes V1 -> V2: - Include memory measurements. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'Control-action-percpu-counters-allocation-by-netlink-flag'
Vlad Buslov says: ==================== Control action percpu counters allocation by netlink flag Currently, significant fraction of CPU time during TC filter allocation is spent in percpu allocator. Moreover, percpu allocator is protected with single global mutex which negates any potential to improve its performance by means of recent developments in TC filter update API that removed rtnl lock for some Qdiscs and classifiers. In order to significantly improve filter update rate and reduce memory usage we would like to allow users to skip percpu counters allocation for specific action if they don't expect high traffic rate hitting the action, which is a reasonable expectation for hardware-offloaded setup. In that case any potential gains to software fast-path performance gained by usage of percpu-allocated counters compared to regular integer counters protected by spinlock are not important, but amount of additional CPU and memory consumed by them is significant. In order to allow configuring action counters allocation type at runtime, implement following changes: - Implement helper functions to update the action counters and use them in affected actions instead of updating counters directly. This steps abstracts actions implementation from counter types that are being used for particular action instance at runtime. - Modify the new helpers to use percpu counters if they were allocated during action initialization and use regular counters otherwise. - Extend action UAPI TCA_ACT space with TCA_ACT_FLAGS field. Add TCA_ACT_FLAGS_NO_PERCPU_STATS action flag and update hardware-offloaded actions to not allocate percpu counters when the flag is set. With this changes users that prefer action update slow-path speed over software fast-path speed can dynamically request actions to skip percpu counters allocation without affecting other users. Now, lets look at actual performance gains provided by this change. Simple test is used to measure insertion rate - iproute2 TC is executed in parallel by xargs in batch mode, its total execution time is measured by shell builtin "time" command. The command runs 20 concurrent tc instances, each with its own batch file with 100k rules: $ time ls add* | xargs -n 1 -P 20 sudo tc -b Two main rule profiles are tested. First is simple L2 flower classifier with single gact drop action. The configuration is chosen as worst case scenario because with single-action rules pressure on percpu allocator is minimized. Example rule: filter add dev ens1f0 protocol ip ingress prio 1 handle 1 flower skip_hw src_mac e4:11:0:0:0:0 dst_mac e4:12:0:0:0:0 action drop Second profile is typical real-world scenario that uses flower classifier with some L2-4 fields and two actions (tunnel_key+mirred). Example rule: filter add dev ens1f0_0 protocol ip ingress prio 1 handle 1 flower skip_hw src_mac e4:11:0:0:0:0 dst_mac e4:12:0:0:0:0 src_ip 192.168.111.1 dst_ip 192.168.111.2 ip_proto udp dst_port 1 src_port 1 action tunnel_key set id 1 src_ip 2.2.2.2 dst_ip 2.2.2.3 dst_port 4789 action mirred egress redirect dev vxlan1 Profile | percpu | no_percpu | X improvement | (k rules/sec) | (k rules/sec) | -------------------+---------------+---------------+--------------- Gact drop | 203 | 259 | 1.28 tunnel_key+mirred | 92 | 204 | 2.22 For simple drop action removing percpu allocation leads to ~25% insertion rate improvement. Perf profiles highlights the bottlenecks. Perf profile of run with percpu allocation (gact drop): + 89.11% 0.48% tc [kernel.vmlinux] [k] entry_SYSCALL_64 + 88.58% 0.04% tc [kernel.vmlinux] [k] do_syscall_64 + 87.50% 0.04% tc libc-2.29.so [.] __libc_sendmsg + 86.96% 0.04% tc [kernel.vmlinux] [k] __sys_sendmsg + 86.85% 0.01% tc [kernel.vmlinux] [k] ___sys_sendmsg + 86.60% 0.05% tc [kernel.vmlinux] [k] sock_sendmsg + 86.55% 0.12% tc [kernel.vmlinux] [k] netlink_sendmsg + 86.04% 0.13% tc [kernel.vmlinux] [k] netlink_unicast + 85.42% 0.03% tc [kernel.vmlinux] [k] netlink_rcv_skb + 84.68% 0.04% tc [kernel.vmlinux] [k] rtnetlink_rcv_msg + 84.56% 0.24% tc [kernel.vmlinux] [k] tc_new_tfilter + 75.73% 0.65% tc [cls_flower] [k] fl_change + 71.30% 0.03% tc [kernel.vmlinux] [k] tcf_exts_validate + 71.27% 0.13% tc [kernel.vmlinux] [k] tcf_action_init + 71.06% 0.01% tc [kernel.vmlinux] [k] tcf_action_init_1 + 70.41% 0.04% tc [act_gact] [k] tcf_gact_init + 53.59% 1.21% tc [kernel.vmlinux] [k] __mutex_lock.isra.0 + 52.34% 0.34% tc [kernel.vmlinux] [k] tcf_idr_create - 51.23% 2.17% tc [kernel.vmlinux] [k] pcpu_alloc - 49.05% pcpu_alloc + 39.35% __mutex_lock.isra.0 4.99% memset_erms + 2.16% pcpu_alloc_area + 2.17% __libc_sendmsg + 45.89% 44.33% tc [kernel.vmlinux] [k] osq_lock + 9.94% 0.04% tc [kernel.vmlinux] [k] tcf_idr_check_alloc + 7.76% 0.00% tc [kernel.vmlinux] [k] tcf_idr_insert + 6.50% 0.03% tc [kernel.vmlinux] [k] tfilter_notify + 6.24% 6.11% tc [kernel.vmlinux] [k] mutex_spin_on_owner + 5.73% 5.32% tc [kernel.vmlinux] [k] memset_erms + 5.31% 0.18% tc [kernel.vmlinux] [k] tcf_fill_node Here bottleneck is clearly in pcpu_alloc() function that takes more than half CPU time, which is mostly wasted busy-waiting for internal percpu allocator global lock. With percpu allocation removed (gact drop): + 87.50% 0.51% tc [kernel.vmlinux] [k] entry_SYSCALL_64 + 86.94% 0.07% tc [kernel.vmlinux] [k] do_syscall_64 + 85.75% 0.04% tc libc-2.29.so [.] __libc_sendmsg + 85.00% 0.07% tc [kernel.vmlinux] [k] __sys_sendmsg + 84.84% 0.07% tc [kernel.vmlinux] [k] ___sys_sendmsg + 84.59% 0.01% tc [kernel.vmlinux] [k] sock_sendmsg + 84.58% 0.14% tc [kernel.vmlinux] [k] netlink_sendmsg + 83.95% 0.12% tc [kernel.vmlinux] [k] netlink_unicast + 83.34% 0.01% tc [kernel.vmlinux] [k] netlink_rcv_skb + 82.39% 0.12% tc [kernel.vmlinux] [k] rtnetlink_rcv_msg + 82.16% 0.25% tc [kernel.vmlinux] [k] tc_new_tfilter + 75.13% 0.84% tc [cls_flower] [k] fl_change + 69.92% 0.05% tc [kernel.vmlinux] [k] tcf_exts_validate + 69.87% 0.11% tc [kernel.vmlinux] [k] tcf_action_init + 69.61% 0.02% tc [kernel.vmlinux] [k] tcf_action_init_1 - 68.80% 0.10% tc [act_gact] [k] tcf_gact_init - 68.70% tcf_gact_init + 36.08% tcf_idr_check_alloc + 31.88% tcf_idr_insert + 63.72% 0.58% tc [kernel.vmlinux] [k] __mutex_lock.isra.0 + 58.80% 56.68% tc [kernel.vmlinux] [k] osq_lock + 36.08% 0.04% tc [kernel.vmlinux] [k] tcf_idr_check_alloc + 31.88% 0.01% tc [kernel.vmlinux] [k] tcf_idr_insert The gact actions (like all other actions types) are inserted in single idr instance protected by global (per namespace) lock that becomes new bottleneck with such simple rule profile and prevents achieving 2x+ performance increase that can be expected by looking at profiling data for insertion action with percpu counter. Perf profile of run with percpu allocation (tunnel_key+mirred): + 91.95% 0.21% tc [kernel.vmlinux] [k] entry_SYSCALL_64 + 91.74% 0.06% tc [kernel.vmlinux] [k] do_syscall_64 + 90.74% 0.01% tc libc-2.29.so [.] __libc_sendmsg + 90.52% 0.01% tc [kernel.vmlinux] [k] __sys_sendmsg + 90.50% 0.04% tc [kernel.vmlinux] [k] ___sys_sendmsg + 90.41% 0.02% tc [kernel.vmlinux] [k] sock_sendmsg + 90.38% 0.04% tc [kernel.vmlinux] [k] netlink_sendmsg + 90.10% 0.06% tc [kernel.vmlinux] [k] netlink_unicast + 89.76% 0.01% tc [kernel.vmlinux] [k] netlink_rcv_skb + 89.28% 0.04% tc [kernel.vmlinux] [k] rtnetlink_rcv_msg + 89.15% 0.03% tc [kernel.vmlinux] [k] tc_new_tfilter + 83.41% 0.33% tc [cls_flower] [k] fl_change + 81.17% 0.04% tc [kernel.vmlinux] [k] tcf_exts_validate + 81.13% 0.06% tc [kernel.vmlinux] [k] tcf_action_init + 81.04% 0.04% tc [kernel.vmlinux] [k] tcf_action_init_1 - 73.59% 2.16% tc [kernel.vmlinux] [k] pcpu_alloc - 71.42% pcpu_alloc + 61.41% __mutex_lock.isra.0 5.02% memset_erms + 2.93% pcpu_alloc_area + 2.16% __libc_sendmsg + 63.58% 0.17% tc [kernel.vmlinux] [k] tcf_idr_create + 63.40% 0.60% tc [kernel.vmlinux] [k] __mutex_lock.isra.0 + 57.85% 56.38% tc [kernel.vmlinux] [k] osq_lock + 46.27% 0.13% tc [act_tunnel_key] [k] tunnel_key_init + 34.26% 0.02% tc [act_mirred] [k] tcf_mirred_init + 10.99% 0.00% tc [kernel.vmlinux] [k] dst_cache_init + 5.32% 5.11% tc [kernel.vmlinux] [k] memset_erms With two times more actions pressure on percpu allocator doubles, so now it takes ~74% of CPU execution time. With percpu allocation removed (tunnel_key+mirred): + 86.02% 0.50% tc [kernel.vmlinux] [k] entry_SYSCALL_64 + 85.51% 0.12% tc [kernel.vmlinux] [k] do_syscall_64 + 84.40% 0.03% tc libc-2.29.so [.] __libc_sendmsg + 83.84% 0.03% tc [kernel.vmlinux] [k] __sys_sendmsg + 83.72% 0.01% tc [kernel.vmlinux] [k] ___sys_sendmsg + 83.56% 0.01% tc [kernel.vmlinux] [k] sock_sendmsg + 83.50% 0.08% tc [kernel.vmlinux] [k] netlink_sendmsg + 83.02% 0.17% tc [kernel.vmlinux] [k] netlink_unicast + 82.48% 0.00% tc [kernel.vmlinux] [k] netlink_rcv_skb + 81.89% 0.11% tc [kernel.vmlinux] [k] rtnetlink_rcv_msg + 81.71% 0.25% tc [kernel.vmlinux] [k] tc_new_tfilter + 73.99% 0.63% tc [cls_flower] [k] fl_change + 69.72% 0.00% tc [kernel.vmlinux] [k] tcf_exts_validate + 69.72% 0.09% tc [kernel.vmlinux] [k] tcf_action_init + 69.53% 0.05% tc [kernel.vmlinux] [k] tcf_action_init_1 + 53.08% 0.91% tc [kernel.vmlinux] [k] __mutex_lock.isra.0 + 45.52% 43.99% tc [kernel.vmlinux] [k] osq_lock - 36.02% 0.21% tc [act_tunnel_key] [k] tunnel_key_init - 35.81% tunnel_key_init + 15.95% tcf_idr_check_alloc + 13.91% tcf_idr_insert - 4.70% dst_cache_init + 4.68% pcpu_alloc + 33.22% 0.04% tc [kernel.vmlinux] [k] tcf_idr_check_alloc + 32.34% 0.05% tc [act_mirred] [k] tcf_mirred_init + 28.24% 0.01% tc [kernel.vmlinux] [k] tcf_idr_insert + 7.79% 0.05% tc [kernel.vmlinux] [k] idr_alloc_u32 + 7.67% 7.35% tc [kernel.vmlinux] [k] idr_get_free + 6.46% 6.22% tc [kernel.vmlinux] [k] mutex_spin_on_owner + 5.11% 0.05% tc [kernel.vmlinux] [k] tfilter_notify With percpu allocation removed insertion rate is increased by ~120%. Such rule profile scales much better than simple single action because both types of actions were competing for single lock in percpu allocator, but not for action idr lock, which is per-action. Note that percpu allocator is still used by dst_cache in tunnel_key actions and consumes 4.68% CPU time. Dst_cache seems like good opportunity for further insertion rate optimization but is not addressed by this change. Another improvement provided by this change is significantly reduced memory usage. The test is implemented by sampling "used memory" value from "vmstat -s" command output. Following table includes memory usage measurements for same two configurations that were used for measuring insertion rate: Profile | Mem per rule | Mem per rule no_percpu | Less memory used | (KB) | (KB) | (KB) -------------------+--------------+------------------------+------------------ Gact drop | 3.91 | 2.51 | 1.4 tunnel_key+mirred | 6.73 | 3.91 | 2.8 Results indicate that memory usage of percpu allocator per action is ~1.4 KB. Note that any measurements of percpu allocator memory usage is inherently tied to particular setup since memory usage is linear to number of cores in system. It is to be expected that on current top of the line servers percpu allocator memory usage will be 2-5x more than on 24 CPUs setup that was used for testing. Setup details: 2x Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz, 32GB memory Patches applied on top of net-next branch: commit 2203cbf2 (net-next) Author: Russell King <rmk+kernel@armlinux.org.uk> Date: Tue Oct 15 11:38:39 2019 +0100 net: sfp: move fwnode parsing into sfp-bus layer Changes V1 -> V2: - Include memory measurements. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
d86784fe · David S. Miller · 21d8bd12 · 9ae6b787 · d86784fe · d86784fe
Commit d86784fe authored Oct 30, 2019 by David S. Miller
29 changed files
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -41,6 +41,7 @@ struct tc_action {
 	struct gnet_stats_queue __percpu *cpu_qstats;
 	struct tc_cookie	__rcu *act_cookie;
 	struct tcf_chain	__rcu *goto_chain;
+	u32			tcfa_flags;
 };
 #define tcf_index	common.tcfa_index
 #define tcf_refcnt	common.tcfa_refcnt
@@ -94,7 +95,7 @@ struct tc_action_ops {
 	int     (*init)(struct net *net, struct nlattr *nla,
 			struct nlattr *est, struct tc_action **act, int ovr,
 			int bind, bool rtnl_held, struct tcf_proto *tp,
-			struct netlink_ext_ack *extack);
+			u32 flags, struct netlink_ext_ack *extack);
 	int     (*walk)(struct net *, struct sk_buff *,
 			struct netlink_callback *, int,
 			const struct tc_action_ops *,
@@ -154,7 +155,11 @@ int tcf_generic_walker(struct tc_action_net *tn, struct sk_buff *skb,
 int tcf_idr_search(struct tc_action_net *tn, struct tc_action **a, u32 index);
 int tcf_idr_create(struct tc_action_net *tn, u32 index, struct nlattr *est,
 		   struct tc_action **a, const struct tc_action_ops *ops,
-		   int bind, bool cpustats);
+		   int bind, bool cpustats, u32 flags);
+int tcf_idr_create_from_flags(struct tc_action_net *tn, u32 index,
+			      struct nlattr *est, struct tc_action **a,
+			      const struct tc_action_ops *ops, int bind,
+			      u32 flags);
 void tcf_idr_insert(struct tc_action_net *tn, struct tc_action *a);

 void tcf_idr_cleanup(struct tc_action_net *tn, u32 index);
@@ -186,6 +191,43 @@ int tcf_action_dump(struct sk_buff *skb, struct tc_action *actions[], int bind,
 		    int ref);
 int tcf_action_dump_old(struct sk_buff *skb, struct tc_action *a, int, int);
 int tcf_action_dump_1(struct sk_buff *skb, struct tc_action *a, int, int);
+
+static inline void tcf_action_update_bstats(struct tc_action *a,
+					    struct sk_buff *skb)
+{
+	if (likely(a->cpu_bstats)) {
+		bstats_cpu_update(this_cpu_ptr(a->cpu_bstats), skb);
+		return;
+	}
+	spin_lock(&a->tcfa_lock);
+	bstats_update(&a->tcfa_bstats, skb);
+	spin_unlock(&a->tcfa_lock);
+}
+
+static inline void tcf_action_inc_drop_qstats(struct tc_action *a)
+{
+	if (likely(a->cpu_qstats)) {
+		qstats_drop_inc(this_cpu_ptr(a->cpu_qstats));
+		return;
+	}
+	spin_lock(&a->tcfa_lock);
+	qstats_drop_inc(&a->tcfa_qstats);
+	spin_unlock(&a->tcfa_lock);
+}
+
+static inline void tcf_action_inc_overlimit_qstats(struct tc_action *a)
+{
+	if (likely(a->cpu_qstats)) {
+		qstats_overlimit_inc(this_cpu_ptr(a->cpu_qstats));
+		return;
+	}
+	spin_lock(&a->tcfa_lock);
+	qstats_overlimit_inc(&a->tcfa_qstats);
+	spin_unlock(&a->tcfa_lock);
+}
+
+void tcf_action_update_stats(struct tc_action *a, u64 bytes, u32 packets,
+			     bool drop, bool hw);
 int tcf_action_copy_stats(struct sk_buff *, struct tc_action *, int);

 int tcf_action_check_ctrlact(int action, struct tcf_proto *tp,

--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -1286,17 +1286,9 @@ void mini_qdisc_pair_swap(struct mini_Qdisc_pair *miniqp,
 void mini_qdisc_pair_init(struct mini_Qdisc_pair *miniqp, struct Qdisc *qdisc,
 			  struct mini_Qdisc __rcu **p_miniq);

-static inline void skb_tc_reinsert(struct sk_buff *skb, struct tcf_result *res)
+static inline int skb_tc_reinsert(struct sk_buff *skb, struct tcf_result *res)
 {
-	struct gnet_stats_queue *stats = res->qstats;
-	int ret;
-
-	if (res->ingress)
-		ret = netif_receive_skb(skb);
-	else
-		ret = dev_queue_xmit(skb);
-	if (ret && stats)
-		qstats_overlimit_inc(res->qstats);
+	return res->ingress ? netif_receive_skb(skb) : dev_queue_xmit(skb);
 }

 #endif
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -16,9 +16,14 @@ enum {
 	TCA_ACT_STATS,
 	TCA_ACT_PAD,
 	TCA_ACT_COOKIE,
+	TCA_ACT_FLAGS,
 	__TCA_ACT_MAX
 };

+#define TCA_ACT_FLAGS_NO_PERCPU_STATS 1 /* Don't use percpu allocator for
+					 * actions stats.
+					 */
+
 #define TCA_ACT_MAX __TCA_ACT_MAX
 #define TCA_OLD_COMPAT (TCA_ACT_MAX+1)
 #define TCA_ACT_MAX_PRIO 32

--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -399,7 +399,7 @@ static int tcf_idr_delete_index(struct tcf_idrinfo *idrinfo, u32 index)

 int tcf_idr_create(struct tc_action_net *tn, u32 index, struct nlattr *est,
 		   struct tc_action **a, const struct tc_action_ops *ops,
-		   int bind, bool cpustats)
+		   int bind, bool cpustats, u32 flags)
 {
 	struct tc_action *p = kzalloc(ops->size, GFP_KERNEL);
 	struct tcf_idrinfo *idrinfo = tn->idrinfo;
@@ -427,6 +427,7 @@ int tcf_idr_create(struct tc_action_net *tn, u32 index, struct nlattr *est,
 	p->tcfa_tm.install = jiffies;
 	p->tcfa_tm.lastuse = jiffies;
 	p->tcfa_tm.firstuse = 0;
+	p->tcfa_flags = flags;
 	if (est) {
 		err = gen_new_estimator(&p->tcfa_bstats, p->cpu_bstats,
 					&p->tcfa_rate_est,
@@ -451,6 +452,17 @@ int tcf_idr_create(struct tc_action_net *tn, u32 index, struct nlattr *est,
 }
 EXPORT_SYMBOL(tcf_idr_create);

+int tcf_idr_create_from_flags(struct tc_action_net *tn, u32 index,
+			      struct nlattr *est, struct tc_action **a,
+			      const struct tc_action_ops *ops, int bind,
+			      u32 flags)
+{
+	/* Set cpustats according to actions flags. */
+	return tcf_idr_create(tn, index, est, a, ops, bind,
+			      !(flags & TCA_ACT_FLAGS_NO_PERCPU_STATS), flags);
+}
+EXPORT_SYMBOL(tcf_idr_create_from_flags);
+
 void tcf_idr_insert(struct tc_action_net *tn, struct tc_action *a)
 {
 	struct tcf_idrinfo *idrinfo = tn->idrinfo;
@@ -773,6 +785,14 @@ tcf_action_dump_1(struct sk_buff *skb, struct tc_action *a, int bind, int ref)
 	}
 	rcu_read_unlock();

+	if (a->tcfa_flags) {
+		struct nla_bitfield32 flags = { a->tcfa_flags,
+						a->tcfa_flags, };
+
+		if (nla_put(skb, TCA_ACT_FLAGS, sizeof(flags), &flags))
+			goto nla_put_failure;
+	}
+
 	nest = nla_nest_start_noflag(skb, TCA_OPTIONS);
 	if (nest == NULL)
 		goto nla_put_failure;
@@ -831,12 +851,15 @@ static struct tc_cookie *nla_memdup_cookie(struct nlattr **tb)
 	return c;
 }

+static const u32 tca_act_flags_allowed = TCA_ACT_FLAGS_NO_PERCPU_STATS;
 static const struct nla_policy tcf_action_policy[TCA_ACT_MAX + 1] = {
 	[TCA_ACT_KIND]		= { .type = NLA_STRING },
 	[TCA_ACT_INDEX]		= { .type = NLA_U32 },
 	[TCA_ACT_COOKIE]	= { .type = NLA_BINARY,
 				    .len = TC_COOKIE_MAX_SIZE },
 	[TCA_ACT_OPTIONS]	= { .type = NLA_NESTED },
+	[TCA_ACT_FLAGS]		= { .type = NLA_BITFIELD32,
+				    .validation_data = &tca_act_flags_allowed },
 };

 struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp,
@@ -845,6 +868,7 @@ struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp,
 				    bool rtnl_held,
 				    struct netlink_ext_ack *extack)
 {
+	struct nla_bitfield32 flags = { 0, 0 };
 	struct tc_action *a;
 	struct tc_action_ops *a_o;
 	struct tc_cookie *cookie = NULL;
@@ -876,6 +900,8 @@ struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp,
 				goto err_out;
 			}
 		}
+		if (tb[TCA_ACT_FLAGS])
+			flags = nla_get_bitfield32(tb[TCA_ACT_FLAGS]);
 	} else {
 		if (strlcpy(act_name, name, IFNAMSIZ) >= IFNAMSIZ) {
 			NL_SET_ERR_MSG(extack, "TC action name too long");
@@ -914,10 +940,10 @@ struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp,
 	/* backward compatibility for policer */
 	if (name == NULL)
 		err = a_o->init(net, tb[TCA_ACT_OPTIONS], est, &a, ovr, bind,
-				rtnl_held, tp, extack);
+				rtnl_held, tp, flags.value, extack);
 	else
 		err = a_o->init(net, nla, est, &a, ovr, bind, rtnl_held,
-				tp, extack);
+				tp, flags.value, extack);
 	if (err < 0)
 		goto err_mod;

@@ -989,6 +1015,29 @@ int tcf_action_init(struct net *net, struct tcf_proto *tp, struct nlattr *nla,
 	return err;
 }

+void tcf_action_update_stats(struct tc_action *a, u64 bytes, u32 packets,
+			     bool drop, bool hw)
+{
+	if (a->cpu_bstats) {
+		_bstats_cpu_update(this_cpu_ptr(a->cpu_bstats), bytes, packets);
+
+		if (drop)
+			this_cpu_ptr(a->cpu_qstats)->drops += packets;
+
+		if (hw)
+			_bstats_cpu_update(this_cpu_ptr(a->cpu_bstats_hw),
+					   bytes, packets);
+		return;
+	}
+
+	_bstats_update(&a->tcfa_bstats, bytes, packets);
+	if (drop)
+		a->tcfa_qstats.drops += packets;
+	if (hw)
+		_bstats_update(&a->tcfa_bstats_hw, bytes, packets);
+}
+EXPORT_SYMBOL(tcf_action_update_stats);
+
 int tcf_action_copy_stats(struct sk_buff *skb, struct tc_action *p,
 			  int compat_mode)
 {

--- a/net/sched/act_bpf.c
+++ b/net/sched/act_bpf.c
@@ -275,7 +275,8 @@ static void tcf_bpf_prog_fill_cfg(const struct tcf_bpf *prog,
 static int tcf_bpf_init(struct net *net, struct nlattr *nla,
 			struct nlattr *est, struct tc_action **act,
 			int replace, int bind, bool rtnl_held,
-			struct tcf_proto *tp, struct netlink_ext_ack *extack)
+			struct tcf_proto *tp, u32 flags,
+			struct netlink_ext_ack *extack)
 {
 	struct tc_action_net *tn = net_generic(net, bpf_net_id);
 	struct nlattr *tb[TCA_ACT_BPF_MAX + 1];
@@ -303,7 +304,7 @@ static int tcf_bpf_init(struct net *net, struct nlattr *nla,
 	ret = tcf_idr_check_alloc(tn, &index, act, bind);
 	if (!ret) {
 		ret = tcf_idr_create(tn, index, est, act,
-				     &act_bpf_ops, bind, true);
+				     &act_bpf_ops, bind, true, 0);
 		if (ret < 0) {
 			tcf_idr_cleanup(tn, index);
 			return ret;

--- a/net/sched/act_connmark.c
+++ b/net/sched/act_connmark.c
@@ -94,7 +94,7 @@ static const struct nla_policy connmark_policy[TCA_CONNMARK_MAX + 1] = {
 static int tcf_connmark_init(struct net *net, struct nlattr *nla,
 			     struct nlattr *est, struct tc_action **a,
 			     int ovr, int bind, bool rtnl_held,
-			     struct tcf_proto *tp,
+			     struct tcf_proto *tp, u32 flags,
 			     struct netlink_ext_ack *extack)
 {
 	struct tc_action_net *tn = net_generic(net, connmark_net_id);
@@ -121,7 +121,7 @@ static int tcf_connmark_init(struct net *net, struct nlattr *nla,
 	ret = tcf_idr_check_alloc(tn, &index, a, bind);
 	if (!ret) {
 		ret = tcf_idr_create(tn, index, est, a,
-				     &act_connmark_ops, bind, false);
+				     &act_connmark_ops, bind, false, 0);
 		if (ret) {
 			tcf_idr_cleanup(tn, index);
 			return ret;

--- a/net/sched/act_csum.c
+++ b/net/sched/act_csum.c
@@ -43,7 +43,7 @@ static struct tc_action_ops act_csum_ops;
 static int tcf_csum_init(struct net *net, struct nlattr *nla,
 			 struct nlattr *est, struct tc_action **a, int ovr,
 			 int bind, bool rtnl_held, struct tcf_proto *tp,
-			 struct netlink_ext_ack *extack)
+			 u32 flags, struct netlink_ext_ack *extack)
 {
 	struct tc_action_net *tn = net_generic(net, csum_net_id);
 	struct tcf_csum_params *params_new;
@@ -68,8 +68,8 @@ static int tcf_csum_init(struct net *net, struct nlattr *nla,
 	index = parm->index;
 	err = tcf_idr_check_alloc(tn, &index, a, bind);
 	if (!err) {
-		ret = tcf_idr_create(tn, index, est, a,
-				     &act_csum_ops, bind, true);
+		ret = tcf_idr_create_from_flags(tn, index, est, a,
+						&act_csum_ops, bind, flags);
 		if (ret) {
 			tcf_idr_cleanup(tn, index);
 			return ret;
@@ -580,7 +580,7 @@ static int tcf_csum_act(struct sk_buff *skb, const struct tc_action *a,
 	params = rcu_dereference_bh(p->params);

 	tcf_lastuse_update(&p->tcf_tm);
-	bstats_cpu_update(this_cpu_ptr(p->common.cpu_bstats), skb);
+	tcf_action_update_bstats(&p->common, skb);

 	action = READ_ONCE(p->tcf_action);
 	if (unlikely(action == TC_ACT_SHOT))
@@ -624,7 +624,7 @@ static int tcf_csum_act(struct sk_buff *skb, const struct tc_action *a,
 	return action;

 drop:
-	qstats_drop_inc(this_cpu_ptr(p->common.cpu_qstats));
+	tcf_action_inc_drop_qstats(&p->common);
 	action = TC_ACT_SHOT;
 	goto out;
 }

--- a/net/sched/act_ct.c
+++ b/net/sched/act_ct.c
@@ -465,11 +465,11 @@ static int tcf_ct_act(struct sk_buff *skb, const struct tc_action *a,
 	skb_push_rcsum(skb, nh_ofs);

 out:
-	bstats_cpu_update(this_cpu_ptr(a->cpu_bstats), skb);
+	tcf_action_update_bstats(&c->common, skb);
 	return retval;

 drop:
-	qstats_drop_inc(this_cpu_ptr(a->cpu_qstats));
+	tcf_action_inc_drop_qstats(&c->common);
 	return TC_ACT_SHOT;
 }

@@ -656,7 +656,7 @@ static int tcf_ct_fill_params(struct net *net,
 static int tcf_ct_init(struct net *net, struct nlattr *nla,
 		       struct nlattr *est, struct tc_action **a,
 		       int replace, int bind, bool rtnl_held,
-		       struct tcf_proto *tp,
+		       struct tcf_proto *tp, u32 flags,
 		       struct netlink_ext_ack *extack)
 {
 	struct tc_action_net *tn = net_generic(net, ct_net_id);
@@ -688,8 +688,8 @@ static int tcf_ct_init(struct net *net, struct nlattr *nla,
 		return err;

 	if (!err) {
-		err = tcf_idr_create(tn, index, est, a,
-				     &act_ct_ops, bind, true);
+		err = tcf_idr_create_from_flags(tn, index, est, a,
+						&act_ct_ops, bind, flags);
 		if (err) {
 			tcf_idr_cleanup(tn, index);
 			return err;
@@ -905,11 +905,7 @@ static void tcf_stats_update(struct tc_action *a, u64 bytes, u32 packets,
 {
 	struct tcf_ct *c = to_ct(a);

-	_bstats_cpu_update(this_cpu_ptr(a->cpu_bstats), bytes, packets);
-
-	if (hw)
-		_bstats_cpu_update(this_cpu_ptr(a->cpu_bstats_hw),
-				   bytes, packets);
+	tcf_action_update_stats(a, bytes, packets, false, hw);
 	c->tcf_tm.lastuse = max_t(u64, c->tcf_tm.lastuse, lastuse);
 }


--- a/net/sched/act_ctinfo.c
+++ b/net/sched/act_ctinfo.c
@@ -153,7 +153,7 @@ static const struct nla_policy ctinfo_policy[TCA_CTINFO_MAX + 1] = {
 static int tcf_ctinfo_init(struct net *net, struct nlattr *nla,
 			   struct nlattr *est, struct tc_action **a,
 			   int ovr, int bind, bool rtnl_held,
-			   struct tcf_proto *tp,
+			   struct tcf_proto *tp, u32 flags,
 			   struct netlink_ext_ack *extack)
 {
 	struct tc_action_net *tn = net_generic(net, ctinfo_net_id);
@@ -210,7 +210,7 @@ static int tcf_ctinfo_init(struct net *net, struct nlattr *nla,
 	err = tcf_idr_check_alloc(tn, &index, a, bind);
 	if (!err) {
 		ret = tcf_idr_create(tn, index, est, a,
-				     &act_ctinfo_ops, bind, false);
+				     &act_ctinfo_ops, bind, false, 0);
 		if (ret) {
 			tcf_idr_cleanup(tn, index);
 			return ret;

--- a/net/sched/act_gact.c
+++ b/net/sched/act_gact.c
@@ -53,7 +53,8 @@ static const struct nla_policy gact_policy[TCA_GACT_MAX + 1] = {
 static int tcf_gact_init(struct net *net, struct nlattr *nla,
 			 struct nlattr *est, struct tc_action **a,
 			 int ovr, int bind, bool rtnl_held,
-			 struct tcf_proto *tp, struct netlink_ext_ack *extack)
+			 struct tcf_proto *tp, u32 flags,
+			 struct netlink_ext_ack *extack)
 {
 	struct tc_action_net *tn = net_generic(net, gact_net_id);
 	struct nlattr *tb[TCA_GACT_MAX + 1];
@@ -98,8 +99,8 @@ static int tcf_gact_init(struct net *net, struct nlattr *nla,

 	err = tcf_idr_check_alloc(tn, &index, a, bind);
 	if (!err) {
-		ret = tcf_idr_create(tn, index, est, a,
-				     &act_gact_ops, bind, true);
+		ret = tcf_idr_create_from_flags(tn, index, est, a,
+						&act_gact_ops, bind, flags);
 		if (ret) {
 			tcf_idr_cleanup(tn, index);
 			return ret;
@@ -161,9 +162,9 @@ static int tcf_gact_act(struct sk_buff *skb, const struct tc_action *a,
 		action = gact_rand[ptype](gact);
 	}
 #endif
-	bstats_cpu_update(this_cpu_ptr(gact->common.cpu_bstats), skb);
+	tcf_action_update_bstats(&gact->common, skb);
 	if (action == TC_ACT_SHOT)
-		qstats_drop_inc(this_cpu_ptr(gact->common.cpu_qstats));
+		tcf_action_inc_drop_qstats(&gact->common);

 	tcf_lastuse_update(&gact->tcf_tm);

@@ -177,15 +178,7 @@ static void tcf_gact_stats_update(struct tc_action *a, u64 bytes, u32 packets,
 	int action = READ_ONCE(gact->tcf_action);
 	struct tcf_t *tm = &gact->tcf_tm;

-	_bstats_cpu_update(this_cpu_ptr(gact->common.cpu_bstats), bytes,
-			   packets);
-	if (action == TC_ACT_SHOT)
-		this_cpu_ptr(gact->common.cpu_qstats)->drops += packets;
-
-	if (hw)
-		_bstats_cpu_update(this_cpu_ptr(gact->common.cpu_bstats_hw),
-				   bytes, packets);
-
+	tcf_action_update_stats(a, bytes, packets, action == TC_ACT_SHOT, hw);
 	tm->lastuse = max_t(u64, tm->lastuse, lastuse);
 }


--- a/net/sched/act_ife.c
+++ b/net/sched/act_ife.c
@@ -465,7 +465,8 @@ static int populate_metalist(struct tcf_ife_info *ife, struct nlattr **tb,
 static int tcf_ife_init(struct net *net, struct nlattr *nla,
 			struct nlattr *est, struct tc_action **a,
 			int ovr, int bind, bool rtnl_held,
-			struct tcf_proto *tp, struct netlink_ext_ack *extack)
+			struct tcf_proto *tp, u32 flags,
+			struct netlink_ext_ack *extack)
 {
 	struct tc_action_net *tn = net_generic(net, ife_net_id);
 	struct nlattr *tb[TCA_IFE_MAX + 1];
@@ -522,7 +523,7 @@ static int tcf_ife_init(struct net *net, struct nlattr *nla,

 	if (!exists) {
 		ret = tcf_idr_create(tn, index, est, a, &act_ife_ops,
-				     bind, true);
+				     bind, true, 0);
 		if (ret) {
 			tcf_idr_cleanup(tn, index);
 			kfree(p);

--- a/net/sched/act_ipt.c
+++ b/net/sched/act_ipt.c
@@ -95,7 +95,7 @@ static const struct nla_policy ipt_policy[TCA_IPT_MAX + 1] = {
 static int __tcf_ipt_init(struct net *net, unsigned int id, struct nlattr *nla,
 			  struct nlattr *est, struct tc_action **a,
 			  const struct tc_action_ops *ops, int ovr, int bind,
-			  struct tcf_proto *tp)
+			  struct tcf_proto *tp, u32 flags)
 {
 	struct tc_action_net *tn = net_generic(net, id);
 	struct nlattr *tb[TCA_IPT_MAX + 1];
@@ -144,7 +144,7 @@ static int __tcf_ipt_init(struct net *net, unsigned int id, struct nlattr *nla,

 	if (!exists) {
 		ret = tcf_idr_create(tn, index, est, a, ops, bind,
-				     false);
+				     false, 0);
 		if (ret) {
 			tcf_idr_cleanup(tn, index);
 			return ret;
@@ -205,19 +205,19 @@ static int __tcf_ipt_init(struct net *net, unsigned int id, struct nlattr *nla,
 static int tcf_ipt_init(struct net *net, struct nlattr *nla,
 			struct nlattr *est, struct tc_action **a, int ovr,
 			int bind, bool rtnl_held, struct tcf_proto *tp,
-			struct netlink_ext_ack *extack)
+			u32 flags, struct netlink_ext_ack *extack)
 {
 	return __tcf_ipt_init(net, ipt_net_id, nla, est, a, &act_ipt_ops, ovr,
-			      bind, tp);
+			      bind, tp, flags);
 }

 static int tcf_xt_init(struct net *net, struct nlattr *nla,
 		       struct nlattr *est, struct tc_action **a, int ovr,
 		       int bind, bool unlocked, struct tcf_proto *tp,
-		       struct netlink_ext_ack *extack)
+		       u32 flags, struct netlink_ext_ack *extack)
 {
 	return __tcf_ipt_init(net, xt_net_id, nla, est, a, &act_xt_ops, ovr,
-			      bind, tp);
+			      bind, tp, flags);
 }

 static int tcf_ipt_act(struct sk_buff *skb, const struct tc_action *a,

--- a/net/sched/act_mirred.c
+++ b/net/sched/act_mirred.c
@@ -93,7 +93,7 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla,
 			   struct nlattr *est, struct tc_action **a,
 			   int ovr, int bind, bool rtnl_held,
 			   struct tcf_proto *tp,
-			   struct netlink_ext_ack *extack)
+			   u32 flags, struct netlink_ext_ack *extack)
 {
 	struct tc_action_net *tn = net_generic(net, mirred_net_id);
 	struct nlattr *tb[TCA_MIRRED_MAX + 1];
@@ -148,8 +148,8 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla,
 			NL_SET_ERR_MSG_MOD(extack, "Specified device does not exist");
 			return -EINVAL;
 		}
-		ret = tcf_idr_create(tn, index, est, a,
-				     &act_mirred_ops, bind, true);
+		ret = tcf_idr_create_from_flags(tn, index, est, a,
+						&act_mirred_ops, bind, flags);
 		if (ret) {
 			tcf_idr_cleanup(tn, index);
 			return ret;
@@ -231,7 +231,7 @@ static int tcf_mirred_act(struct sk_buff *skb, const struct tc_action *a,
 	}

 	tcf_lastuse_update(&m->tcf_tm);
-	bstats_cpu_update(this_cpu_ptr(m->common.cpu_bstats), skb);
+	tcf_action_update_bstats(&m->common, skb);

 	m_mac_header_xmit = READ_ONCE(m->tcfm_mac_header_xmit);
 	m_eaction = READ_ONCE(m->tcfm_eaction);
@@ -289,8 +289,8 @@ static int tcf_mirred_act(struct sk_buff *skb, const struct tc_action *a,
 		/* let's the caller reinsert the packet, if possible */
 		if (use_reinsert) {
 			res->ingress = want_ingress;
-			res->qstats = this_cpu_ptr(m->common.cpu_qstats);
-			skb_tc_reinsert(skb, res);
+			if (skb_tc_reinsert(skb, res))
+				tcf_action_inc_overlimit_qstats(&m->common);
 			__this_cpu_dec(mirred_rec_level);
 			return TC_ACT_CONSUMED;
 		}
@@ -303,7 +303,7 @@ static int tcf_mirred_act(struct sk_buff *skb, const struct tc_action *a,

 	if (err) {
 out:
-		qstats_overlimit_inc(this_cpu_ptr(m->common.cpu_qstats));
+		tcf_action_inc_overlimit_qstats(&m->common);
 		if (tcf_mirred_is_act_redirect(m_eaction))
 			retval = TC_ACT_SHOT;
 	}
@@ -318,10 +318,7 @@ static void tcf_stats_update(struct tc_action *a, u64 bytes, u32 packets,
 	struct tcf_mirred *m = to_mirred(a);
 	struct tcf_t *tm = &m->tcf_tm;

-	_bstats_cpu_update(this_cpu_ptr(a->cpu_bstats), bytes, packets);
-	if (hw)
-		_bstats_cpu_update(this_cpu_ptr(a->cpu_bstats_hw),
-				   bytes, packets);
+	tcf_action_update_stats(a, bytes, packets, false, hw);
 	tm->lastuse = max_t(u64, tm->lastuse, lastuse);
 }


--- a/net/sched/act_mpls.c
+++ b/net/sched/act_mpls.c
@@ -131,7 +131,8 @@ static const struct nla_policy mpls_policy[TCA_MPLS_MAX + 1] = {
 static int tcf_mpls_init(struct net *net, struct nlattr *nla,
 			 struct nlattr *est, struct tc_action **a,
 			 int ovr, int bind, bool rtnl_held,
-			 struct tcf_proto *tp, struct netlink_ext_ack *extack)
+			 struct tcf_proto *tp, u32 flags,
+			 struct netlink_ext_ack *extack)
 {
 	struct tc_action_net *tn = net_generic(net, mpls_net_id);
 	struct nlattr *tb[TCA_MPLS_MAX + 1];
@@ -224,7 +225,7 @@ static int tcf_mpls_init(struct net *net, struct nlattr *nla,

 	if (!exists) {
 		ret = tcf_idr_create(tn, index, est, a,
-				     &act_mpls_ops, bind, true);
+				     &act_mpls_ops, bind, true, 0);
 		if (ret) {
 			tcf_idr_cleanup(tn, index);
 			return ret;

--- a/net/sched/act_nat.c
+++ b/net/sched/act_nat.c
@@ -36,7 +36,7 @@ static const struct nla_policy nat_policy[TCA_NAT_MAX + 1] = {
 static int tcf_nat_init(struct net *net, struct nlattr *nla, struct nlattr *est,
 			struct tc_action **a, int ovr, int bind,
 			bool rtnl_held,	struct tcf_proto *tp,
-			struct netlink_ext_ack *extack)
+			u32 flags, struct netlink_ext_ack *extack)
 {
 	struct tc_action_net *tn = net_generic(net, nat_net_id);
 	struct nlattr *tb[TCA_NAT_MAX + 1];
@@ -61,7 +61,7 @@ static int tcf_nat_init(struct net *net, struct nlattr *nla, struct nlattr *est,
 	err = tcf_idr_check_alloc(tn, &index, a, bind);
 	if (!err) {
 		ret = tcf_idr_create(tn, index, est, a,
-				     &act_nat_ops, bind, false);
+				     &act_nat_ops, bind, false, 0);
 		if (ret) {
 			tcf_idr_cleanup(tn, index);
 			return ret;

--- a/net/sched/act_pedit.c
+++ b/net/sched/act_pedit.c
@@ -137,7 +137,8 @@ static int tcf_pedit_key_ex_dump(struct sk_buff *skb,
 static int tcf_pedit_init(struct net *net, struct nlattr *nla,
 			  struct nlattr *est, struct tc_action **a,
 			  int ovr, int bind, bool rtnl_held,
-			  struct tcf_proto *tp, struct netlink_ext_ack *extack)
+			  struct tcf_proto *tp, u32 flags,
+			  struct netlink_ext_ack *extack)
 {
 	struct tc_action_net *tn = net_generic(net, pedit_net_id);
 	struct nlattr *tb[TCA_PEDIT_MAX + 1];
@@ -190,7 +191,7 @@ static int tcf_pedit_init(struct net *net, struct nlattr *nla,
 			goto out_free;
 		}
 		ret = tcf_idr_create(tn, index, est, a,
-				     &act_pedit_ops, bind, false);
+				     &act_pedit_ops, bind, false, 0);
 		if (ret) {
 			tcf_idr_cleanup(tn, index);
 			goto out_free;

--- a/net/sched/act_police.c
+++ b/net/sched/act_police.c
@@ -47,7 +47,7 @@ static const struct nla_policy police_policy[TCA_POLICE_MAX + 1] = {
 static int tcf_police_init(struct net *net, struct nlattr *nla,
 			       struct nlattr *est, struct tc_action **a,
 			       int ovr, int bind, bool rtnl_held,
-			       struct tcf_proto *tp,
+			       struct tcf_proto *tp, u32 flags,
 			       struct netlink_ext_ack *extack)
 {
 	int ret = 0, tcfp_result = TC_ACT_OK, err, size;
@@ -87,7 +87,7 @@ static int tcf_police_init(struct net *net, struct nlattr *nla,

 	if (!exists) {
 		ret = tcf_idr_create(tn, index, NULL, a,
-				     &act_police_ops, bind, true);
+				     &act_police_ops, bind, true, 0);
 		if (ret) {
 			tcf_idr_cleanup(tn, index);
 			return ret;
@@ -294,10 +294,7 @@ static void tcf_police_stats_update(struct tc_action *a,
 	struct tcf_police *police = to_police(a);
 	struct tcf_t *tm = &police->tcf_tm;

-	_bstats_cpu_update(this_cpu_ptr(a->cpu_bstats), bytes, packets);
-	if (hw)
-		_bstats_cpu_update(this_cpu_ptr(a->cpu_bstats_hw),
-				   bytes, packets);
+	tcf_action_update_stats(a, bytes, packets, false, hw);
 	tm->lastuse = max_t(u64, tm->lastuse, lastuse);
 }


--- a/net/sched/act_sample.c
+++ b/net/sched/act_sample.c
@@ -36,7 +36,7 @@ static const struct nla_policy sample_policy[TCA_SAMPLE_MAX + 1] = {
 static int tcf_sample_init(struct net *net, struct nlattr *nla,
 			   struct nlattr *est, struct tc_action **a, int ovr,
 			   int bind, bool rtnl_held, struct tcf_proto *tp,
-			   struct netlink_ext_ack *extack)
+			   u32 flags, struct netlink_ext_ack *extack)
 {
 	struct tc_action_net *tn = net_generic(net, sample_net_id);
 	struct nlattr *tb[TCA_SAMPLE_MAX + 1];
@@ -69,7 +69,7 @@ static int tcf_sample_init(struct net *net, struct nlattr *nla,

 	if (!exists) {
 		ret = tcf_idr_create(tn, index, est, a,
-				     &act_sample_ops, bind, true);
+				     &act_sample_ops, bind, true, 0);
 		if (ret) {
 			tcf_idr_cleanup(tn, index);
 			return ret;

--- a/net/sched/act_simple.c
+++ b/net/sched/act_simple.c
@@ -86,7 +86,8 @@ static const struct nla_policy simple_policy[TCA_DEF_MAX + 1] = {
 static int tcf_simp_init(struct net *net, struct nlattr *nla,
 			 struct nlattr *est, struct tc_action **a,
 			 int ovr, int bind, bool rtnl_held,
-			 struct tcf_proto *tp, struct netlink_ext_ack *extack)
+			 struct tcf_proto *tp, u32 flags,
+			 struct netlink_ext_ack *extack)
 {
 	struct tc_action_net *tn = net_generic(net, simp_net_id);
 	struct nlattr *tb[TCA_DEF_MAX + 1];
@@ -127,7 +128,7 @@ static int tcf_simp_init(struct net *net, struct nlattr *nla,

 	if (!exists) {
 		ret = tcf_idr_create(tn, index, est, a,
-				     &act_simp_ops, bind, false);
+				     &act_simp_ops, bind, false, 0);
 		if (ret) {
 			tcf_idr_cleanup(tn, index);
 			return ret;

--- a/net/sched/act_skbedit.c
+++ b/net/sched/act_skbedit.c
@@ -86,7 +86,7 @@ static const struct nla_policy skbedit_policy[TCA_SKBEDIT_MAX + 1] = {
 static int tcf_skbedit_init(struct net *net, struct nlattr *nla,
 			    struct nlattr *est, struct tc_action **a,
 			    int ovr, int bind, bool rtnl_held,
-			    struct tcf_proto *tp,
+			    struct tcf_proto *tp, u32 act_flags,
 			    struct netlink_ext_ack *extack)
 {
 	struct tc_action_net *tn = net_generic(net, skbedit_net_id);
@@ -165,7 +165,7 @@ static int tcf_skbedit_init(struct net *net, struct nlattr *nla,

 	if (!exists) {
 		ret = tcf_idr_create(tn, index, est, a,
-				     &act_skbedit_ops, bind, true);
+				     &act_skbedit_ops, bind, true, 0);
 		if (ret) {
 			tcf_idr_cleanup(tn, index);
 			return ret;

--- a/net/sched/act_skbmod.c
+++ b/net/sched/act_skbmod.c
@@ -79,7 +79,7 @@ static const struct nla_policy skbmod_policy[TCA_SKBMOD_MAX + 1] = {
 static int tcf_skbmod_init(struct net *net, struct nlattr *nla,
 			   struct nlattr *est, struct tc_action **a,
 			   int ovr, int bind, bool rtnl_held,
-			   struct tcf_proto *tp,
+			   struct tcf_proto *tp, u32 flags,
 			   struct netlink_ext_ack *extack)
 {
 	struct tc_action_net *tn = net_generic(net, skbmod_net_id);
@@ -143,7 +143,7 @@ static int tcf_skbmod_init(struct net *net, struct nlattr *nla,

 	if (!exists) {
 		ret = tcf_idr_create(tn, index, est, a,
-				     &act_skbmod_ops, bind, true);
+				     &act_skbmod_ops, bind, true, 0);
 		if (ret) {
 			tcf_idr_cleanup(tn, index);
 			return ret;

--- a/net/sched/act_tunnel_key.c
+++ b/net/sched/act_tunnel_key.c
@@ -31,7 +31,7 @@ static int tunnel_key_act(struct sk_buff *skb, const struct tc_action *a,
 	params = rcu_dereference_bh(t->params);

 	tcf_lastuse_update(&t->tcf_tm);
-	bstats_cpu_update(this_cpu_ptr(t->common.cpu_bstats), skb);
+	tcf_action_update_bstats(&t->common, skb);
 	action = READ_ONCE(t->tcf_action);

 	switch (params->tcft_action) {
@@ -208,7 +208,7 @@ static void tunnel_key_release_params(struct tcf_tunnel_key_params *p)
 static int tunnel_key_init(struct net *net, struct nlattr *nla,
 			   struct nlattr *est, struct tc_action **a,
 			   int ovr, int bind, bool rtnl_held,
-			   struct tcf_proto *tp,
+			   struct tcf_proto *tp, u32 act_flags,
 			   struct netlink_ext_ack *extack)
 {
 	struct tc_action_net *tn = net_generic(net, tunnel_key_net_id);
@@ -347,8 +347,9 @@ static int tunnel_key_init(struct net *net, struct nlattr *nla,
 	}

 	if (!exists) {
-		ret = tcf_idr_create(tn, index, est, a,
-				     &act_tunnel_key_ops, bind, true);
+		ret = tcf_idr_create_from_flags(tn, index, est, a,
+						&act_tunnel_key_ops, bind,
+						act_flags);
 		if (ret) {
 			NL_SET_ERR_MSG(extack, "Cannot create TC IDR");
 			goto release_tun_meta;

--- a/net/sched/act_vlan.c
+++ b/net/sched/act_vlan.c
@@ -29,7 +29,7 @@ static int tcf_vlan_act(struct sk_buff *skb, const struct tc_action *a,
 	u16 tci;

 	tcf_lastuse_update(&v->tcf_tm);
-	bstats_cpu_update(this_cpu_ptr(v->common.cpu_bstats), skb);
+	tcf_action_update_bstats(&v->common, skb);

 	/* Ensure 'data' points at mac_header prior calling vlan manipulating
 	 * functions.
@@ -88,7 +88,7 @@ static int tcf_vlan_act(struct sk_buff *skb, const struct tc_action *a,
 	return action;

 drop:
-	qstats_drop_inc(this_cpu_ptr(v->common.cpu_qstats));
+	tcf_action_inc_drop_qstats(&v->common);
 	return TC_ACT_SHOT;
 }

@@ -102,7 +102,8 @@ static const struct nla_policy vlan_policy[TCA_VLAN_MAX + 1] = {
 static int tcf_vlan_init(struct net *net, struct nlattr *nla,
 			 struct nlattr *est, struct tc_action **a,
 			 int ovr, int bind, bool rtnl_held,
-			 struct tcf_proto *tp, struct netlink_ext_ack *extack)
+			 struct tcf_proto *tp, u32 flags,
+			 struct netlink_ext_ack *extack)
 {
 	struct tc_action_net *tn = net_generic(net, vlan_net_id);
 	struct nlattr *tb[TCA_VLAN_MAX + 1];
@@ -188,8 +189,8 @@ static int tcf_vlan_init(struct net *net, struct nlattr *nla,
 	action = parm->v_action;

 	if (!exists) {
-		ret = tcf_idr_create(tn, index, est, a,
-				     &act_vlan_ops, bind, true);
+		ret = tcf_idr_create_from_flags(tn, index, est, a,
+						&act_vlan_ops, bind, flags);
 		if (ret) {
 			tcf_idr_cleanup(tn, index);
 			return ret;
@@ -307,10 +308,7 @@ static void tcf_vlan_stats_update(struct tc_action *a, u64 bytes, u32 packets,
 	struct tcf_vlan *v = to_vlan(a);
 	struct tcf_t *tm = &v->tcf_tm;

-	_bstats_cpu_update(this_cpu_ptr(a->cpu_bstats), bytes, packets);
-	if (hw)
-		_bstats_cpu_update(this_cpu_ptr(a->cpu_bstats_hw),
-				   bytes, packets);
+	tcf_action_update_stats(a, bytes, packets, false, hw);
 	tm->lastuse = max_t(u64, tm->lastuse, lastuse);
 }


--- a/tools/testing/selftests/tc-testing/tc-tests/actions/csum.json
+++ b/tools/testing/selftests/tc-testing/tc-tests/actions/csum.json
@@ -525,5 +525,29 @@
        "teardown": [
            "$TC actions flush action csum"
        ]
+    },
+    {
+        "id": "eaf0",
+        "name": "Add csum iph action with no_percpu flag",
+        "category": [
+            "actions",
+            "csum"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action csum",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action csum iph no_percpu",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions list action csum",
+        "matchPattern": "action order [0-9]*: csum \\(iph\\) action pass.*no_percpu",
+        "matchCount": "1",
+        "teardown": [
+            "$TC actions flush action csum"
+        ]
    }
 ]
--- a/tools/testing/selftests/tc-testing/tc-tests/actions/ct.json
+++ b/tools/testing/selftests/tc-testing/tc-tests/actions/ct.json
@@ -310,5 +310,29 @@
        "teardown": [
            "$TC actions flush action ct"
        ]
+    },
+    {
+        "id": "3991",
+        "name": "Add simple ct action with no_percpu flag",
+        "category": [
+            "actions",
+            "ct"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action ct",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action ct no_percpu",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions list action ct",
+        "matchPattern": "action order [0-9]*: ct zone 0 pipe.*no_percpu",
+        "matchCount": "1",
+        "teardown": [
+            "$TC actions flush action ct"
+        ]
    }
 ]
--- a/tools/testing/selftests/tc-testing/tc-tests/actions/gact.json
+++ b/tools/testing/selftests/tc-testing/tc-tests/actions/gact.json
@@ -585,5 +585,29 @@
        "teardown": [
            "$TC actions flush action gact"
        ]
+    },
+    {
+        "id": "95ad",
+        "name": "Add gact pass action with no_percpu flag",
+        "category": [
+            "actions",
+            "gact"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action gact",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action pass no_percpu",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions list action gact",
+        "matchPattern": "action order [0-9]*: gact action pass.*no_percpu",
+        "matchCount": "1",
+        "teardown": [
+            "$TC actions flush action gact"
+        ]
    }
 ]
--- a/tools/testing/selftests/tc-testing/tc-tests/actions/mirred.json
+++ b/tools/testing/selftests/tc-testing/tc-tests/actions/mirred.json
@@ -553,5 +553,29 @@
        "matchPattern": "^[ \t]+index [0-9]+ ref",
        "matchCount": "0",
        "teardown": []
+    },
+    {
+        "id": "31e3",
+        "name": "Add mirred mirror to egress action with no_percpu flag",
+        "category": [
+            "actions",
+            "mirred"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action mirred",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action mirred egress mirror dev lo no_percpu",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions list action mirred",
+        "matchPattern": "action order [0-9]*: mirred \\(Egress Mirror to device lo\\).*no_percpu",
+        "matchCount": "1",
+        "teardown": [
+            "$TC actions flush action mirred"
+        ]
    }
 ]
--- a/tools/testing/selftests/tc-testing/tc-tests/actions/tunnel_key.json
+++ b/tools/testing/selftests/tc-testing/tc-tests/actions/tunnel_key.json
@@ -909,5 +909,29 @@
        "teardown": [
            "$TC actions flush action tunnel_key"
        ]
+    },
+    {
+        "id": "0cd2",
+        "name": "Add tunnel_key set action with no_percpu flag",
+        "category": [
+            "actions",
+            "tunnel_key"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action tunnel_key",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action tunnel_key set src_ip 10.10.10.1 dst_ip 20.20.20.2 id 1 no_percpu",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions list action tunnel_key",
+        "matchPattern": "action order [0-9]+: tunnel_key.*set.*src_ip 10.10.10.1.*dst_ip 20.20.20.2.*key_id 1.*no_percpu",
+        "matchCount": "1",
+        "teardown": [
+            "$TC actions flush action tunnel_key"
+        ]
    }
 ]
--- a/tools/testing/selftests/tc-testing/tc-tests/actions/vlan.json
+++ b/tools/testing/selftests/tc-testing/tc-tests/actions/vlan.json
@@ -807,5 +807,29 @@
        "matchPattern": "^[ \t]+index [0-9]+ ref",
        "matchCount": "0",
        "teardown": []
+    },
+    {
+        "id": "1a3d",
+        "name": "Add vlan pop action with no_percpu flag",
+        "category": [
+            "actions",
+            "vlan"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action vlan",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action vlan pop no_percpu",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions list action vlan",
+        "matchPattern": "action order [0-9]+: vlan.*pop.*no_percpu",
+        "matchCount": "1",
+        "teardown": [
+            "$TC actions flush action vlan"
+        ]
    }
 ]