- 21 Sep, 2022 3 commits
-
-
Christophe JAILLET authored
When the SPDX-License-Identifier tag has been added, the corresponding license text has not been removed. Remove it now. Also, in xt_connmark.h, move the copyright text at the top of the file which is a much more common pattern. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Florian Westphal <fw@strlen.de>
-
Antoine Tenart authored
The previous commit changed the way the rescheduling delay is computed which has a side effect: the bias is now represented as much as the other entries in the rescheduling delay which makes the logic to kick in only with very large sets, as the initial interval is very large (INT_MAX). Revisit the GC initial bias to allow more frequent GC for smaller sets while still avoiding wakeups when a machine is mostly idle. We're moving from a large initial value to pretending we have 100 entries expiring at the upper bound. This way only a few entries having a small timeout won't impact much the rescheduling delay and non-idle machines will have enough entries to lower the delay when needed. This also improves readability as the initial bias is now linked to what is computed instead of being an arbitrary large value. Fixes: 2cfadb76 ("netfilter: conntrack: revisit gc autotuning") Suggested-by: Florian Westphal <fw@strlen.de> Signed-off-by: Antoine Tenart <atenart@kernel.org> Signed-off-by: Florian Westphal <fw@strlen.de>
-
Antoine Tenart authored
Commit 2cfadb76 ("netfilter: conntrack: revisit gc autotuning") changed the eviction rescheduling to the use average expiry of scanned entries (within 1-60s) by doing: for (...) { expires = clamp(nf_ct_expires(tmp), ...); next_run += expires; next_run /= 2; } The issue is the above will make the average ('next_run' here) more dependent on the last expiration values than the firsts (for sets > 2). Depending on the expiration values used to compute the average, the result can be quite different than what's expected. To fix this we can do the following: for (...) { expires = clamp(nf_ct_expires(tmp), ...); next_run += (expires - next_run) / ++count; } Fixes: 2cfadb76 ("netfilter: conntrack: revisit gc autotuning") Cc: Florian Westphal <fw@strlen.de> Signed-off-by: Antoine Tenart <atenart@kernel.org> Signed-off-by: Florian Westphal <fw@strlen.de>
-
- 20 Sep, 2022 37 commits
-
-
Ruffalo Lavoisier authored
- Delete the repeated word 'to' in the comment. - Add the missing 'use' word within the sentence. - Correct spelling on 'malformation', 'needs'. Signed-off-by: Ruffalo Lavoisier <RuffaloLavoisier@gmail.com> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Link: https://lore.kernel.org/r/20220919053447.5702-1-RuffaloLavoisier@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Ren Zhijie authored
If CONFIG_DCB is not set, make ARCH=x86_64 CROSS_COMPILE=x86_64-linux-gnu-, will be failed, like this: drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c: In function ‘otx2_select_queue’: drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c:1886:19: error: unused variable ‘pf’ [-Werror=unused-variable] struct otx2_nic *pf = netdev_priv(netdev); ^~ cc1: all warnings being treated as errors To fix this build error, put the definition of *pf under the CONFIG_DCB. Fixes: 99c969a8 ("octeontx2-pf: Add egress PFC support") Signed-off-by: Ren Zhijie <renzhijie2@huawei.com> Link: https://lore.kernel.org/r/20220919025840.256411-1-renzhijie2@huawei.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Karol Kolacinski authored
E810 products can support low latency Tx timestamp register read. This requires usage of threaded IRQ instead of kthread to reduce the kthread start latency (spikes up to 20 ms). Add a check for the device capability and use the new method if supported. Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com> Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://lore.kernel.org/r/20220916201728.241510-1-anthony.l.nguyen@intel.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jakub Kicinski authored
Fabio Porcedda says: ==================== Add a secondary AT port to the Telit FN990 In order to add a secondary AT port to the Telit FN990 first add "DUN2" to mhi_wwan_ctrl.c, after that add a seconday AT port to the Telit FN990 in pci_generic.c ==================== Link: https://lore.kernel.org/r/20220916144329.243368-1-fabio.porcedda@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Fabio Porcedda authored
Add a secondary AT port using one of OEM reserved channel. Signed-off-by: Fabio Porcedda <fabio.porcedda@gmail.com> Reviewed-by: Loic Poulain <loic.poulain@linaro.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Fabio Porcedda authored
In order to have a secondary AT port add "DUN2". Signed-off-by: Fabio Porcedda <fabio.porcedda@gmail.com> Reviewed-by: Loic Poulain <loic.poulain@linaro.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jakub Kicinski authored
Zhengchao Shao says: ==================== refactor duplicate codes in the tc cls walk function The walk implementation of most tc cls modules is basically the same. That is, the values of count and skip are checked first. If count is greater than or equal to skip, the registered fn function is executed. Otherwise, increase the value of count. So the code can be refactored. Then use helper function to replace the code of each cls module in alphabetical order. The walk function is invoked during dump. Therefore, test cases related to the tdc filter need to be added. ==================== Link: https://lore.kernel.org/r/20220916020251.190097-1-shaozhengchao@huawei.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Zhengchao Shao authored
Test 0811: Add multiple basic filter with cmp ematch u8/link layer and default action and dump them Test 5129: List basic filters Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Tested-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Zhengchao Shao authored
Test 8293: Add tcindex filter with default action Test 7281: Add tcindex filter with hash size and pass action Test b294: Add tcindex filter with mask shift and reclassify action Test 0532: Add tcindex filter with pass_on and continue actions Test d473: Add tcindex filter with pipe action Test 2940: Add tcindex filter with miltiple actions Test 1893: List tcindex filters Test 2041: Change tcindex filter with pass action Test 9203: Replace tcindex filter with pass action Test 7957: Delete tcindex filter with drop action Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Tested-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Zhengchao Shao authored
Test 2141: Add rsvp filter with tcp proto and specific IP address Test 5267: Add rsvp filter with udp proto and specific IP address Test 2819: Add rsvp filter with src ip and src port Test c967: Add rsvp filter with tunnelid and continue action Test 5463: Add rsvp filter with tunnel and pipe action Test 2332: Add rsvp filter with miltiple actions Test 8879: Add rsvp filter with tunnel and skp flag Test 8261: List rsvp filters Test 8989: Delete rsvp filter Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Tested-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Zhengchao Shao authored
Test e122: Add route filter with from and to tag Test 6573: Add route filter with fromif and to tag Test 1362: Add route filter with to flag and reclassify action Test 4720: Add route filter with from flag and continue actions Test 2812: Add route filter with form tag and pipe action Test 7994: Add route filter with miltiple actions Test 4312: List route filters Test 2634: Delete route filter with pipe action Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Tested-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Zhengchao Shao authored
Test 5294: Add flow filter with map key and ops Test 3514: Add flow filter with map key or ops Test 7534: Add flow filter with map key xor ops Test 4524: Add flow filter with map key rshift ops Test 0230: Add flow filter with map key addend ops Test 2344: Add flow filter with src map key Test 9304: Add flow filter with proto map key Test 9038: Add flow filter with proto-src map key Test 2a03: Add flow filter with proto-dst map key Test a073: Add flow filter with iif map key Test 3b20: Add flow filter with priority map key Test 8945: Add flow filter with mark map key Test c034: Add flow filter with nfct map key Test 0205: Add flow filter with nfct-src map key Test 5315: Add flow filter with nfct-src map key Test 7849: Add flow filter with nfct-proto-src map key Test 9902: Add flow filter with nfct-proto-dst map key Test 6742: Add flow filter with rt-classid map key Test 5432: Add flow filter with sk-uid map key Test 4134: Add flow filter with sk-gid map key Test 4522: Add flow filter with vlan-tag map key Test 4253: Add flow filter with rxhash map key Test 4452: Add flow filter with hash key list Test 4341: Add flow filter with muliple ops Test 4392: List flow filters Test 4322: Change flow filter with map key num Test 2320: Replace flow filter with map key num Test 3213: Delete flow filter with map key num Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Tested-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Zhengchao Shao authored
Test 6273: Add cgroup filter with cmp ematch u8/link layer and drop action Test 4721: Add cgroup filter with cmp ematch u8/link layer with trans flag and pass action Test d392: Add cgroup filter with cmp ematch u16/link layer and pipe action Test 0234: Add cgroup filter with cmp ematch u32/link layer and miltiple actions Test 8499: Add cgroup filter with cmp ematch u8/network layer and pass action Test b273: Add cgroup filter with cmp ematch u8/network layer with trans flag and drop action Test 1934: Add cgroup filter with cmp ematch u16/network layer and pipe action Test 2733: Add cgroup filter with cmp ematch u32/network layer and miltiple actions Test 3271: Add cgroup filter with NOT cmp ematch rule and pass action Test 2362: Add cgroup filter with two ANDed cmp ematch rules and single action Test 9993: Add cgroup filter with two ORed cmp ematch rules and single action Test 2331: Add cgroup filter with two ANDed cmp ematch rules and one ORed ematch rule and single action Test 3645: Add cgroup filter with two ANDed cmp ematch rules and one NOT ORed ematch rule and single action Test b124: Add cgroup filter with u32 ematch u8/zero offset and drop action Test 7381: Add cgroup filter with u32 ematch u8/zero offset and invalid value >0xFF Test 2231: Add cgroup filter with u32 ematch u8/positive offset and drop action Test 1882: Add cgroup filter with u32 ematch u8/invalid mask >0xFF Test 1237: Add cgroup filter with u32 ematch u8/missing offset Test 3812: Add cgroup filter with u32 ematch u8/missing AT keyword Test 1112: Add cgroup filter with u32 ematch u8/missing value Test 3241: Add cgroup filter with u32 ematch u8/non-numeric value Test e231: Add cgroup filter with u32 ematch u8/non-numeric mask Test 4652: Add cgroup filter with u32 ematch u8/negative offset and pass Test 1331: Add cgroup filter with u32 ematch u16/zero offset and pipe action Test e354: Add cgroup filter with u32 ematch u16/zero offset and invalid value >0xFFFF Test 3538: Add cgroup filter with u32 ematch u16/positive offset and drop action Test 4576: Add cgroup filter with u32 ematch u16/invalid mask >0xFFFF Test b842: Add cgroup filter with u32 ematch u16/missing offset Test c924: Add cgroup filter with u32 ematch u16/missing AT keyword Test cc93: Add cgroup filter with u32 ematch u16/missing value Test 123c: Add cgroup filter with u32 ematch u16/non-numeric value Test 3675: Add cgroup filter with u32 ematch u16/non-numeric mask Test 1123: Add cgroup filter with u32 ematch u16/negative offset and drop action Test 4234: Add cgroup filter with u32 ematch u16/nexthdr+ offset and pass action Test e912: Add cgroup filter with u32 ematch u32/zero offset and pipe action Test 1435: Add cgroup filter with u32 ematch u32/positive offset and drop action Test 1282: Add cgroup filter with u32 ematch u32/missing offset Test 6456: Add cgroup filter with u32 ematch u32/missing AT keyword Test 4231: Add cgroup filter with u32 ematch u32/missing value Test 2131: Add cgroup filter with u32 ematch u32/non-numeric value Test f125: Add cgroup filter with u32 ematch u32/non-numeric mask Test 4316: Add cgroup filter with u32 ematch u32/negative offset and drop action Test 23ae: Add cgroup filter with u32 ematch u32/nexthdr+ offset and pipe action Test 23a1: Add cgroup filter with canid ematch and single SFF Test 324f: Add cgroup filter with canid ematch and single SFF with mask Test 2576: Add cgroup filter with canid ematch and multiple SFF Test 4839: Add cgroup filter with canid ematch and multiple SFF with masks Test 6713: Add cgroup filter with canid ematch and single EFF Test ab9d: Add cgroup filter with canid ematch and multiple EFF with masks Test 5349: Add cgroup filter with canid ematch and a combination of SFF/EFF Test c934: Add cgroup filter with canid ematch and a combination of SFF/EFF with masks Test 4319: Replace cgroup filter with diffferent match Test 4636: Delete cgroup filter Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Tested-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Zhengchao Shao authored
Test 23c3: Add cBPF filter with valid bytecode Test 1563: Add cBPF filter with invalid bytecode Test 2334: Add eBPF filter with valid object-file Test 2373: Add eBPF filter with invalid object-file Test 4423: Replace cBPF bytecode Test 5122: Delete cBPF filter Test e0a9: List cBPF filters Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Tested-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Zhengchao Shao authored
use tc_cls_stats_dump() in filter. Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Tested-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Zhengchao Shao authored
The walk implementation of most tc cls modules is basically the same. That is, the values of count and skip are checked first. If count is greater than or equal to skip, the registered fn function is executed. Otherwise, increase the value of count. So we can reconstruct them. Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Tested-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Rafał Miłecki authored
Reading MAC from OF may return -EPROBE_DEFER if underlaying NVMEM device isn't ready yet. In such case pass that error code up and "wait" to be probed later. Signed-off-by: Rafał Miłecki <rafal@milecki.pl> Link: https://lore.kernel.org/r/20220915133013.2243-1-zajec5@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Lukas Bulwahn authored
It makes little sense to ask if networking namespace or net device refcount tracking shall be enabled for debug kernel builds without network support. This is similar to the commit eb0b39ef ("net: CONFIG_DEBUG_NET depends on CONFIG_NET"). Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Link: https://lore.kernel.org/r/20220915124256.32512-1-lukas.bulwahn@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jakub Kicinski authored
Vladimir Oltean says: ==================== Small tc-taprio improvements This series contains: - the proper protected variant of rcu_dereference() of admin and oper schedules for accesses from the slow path - a removal of an extra function pointer indirection for qdisc->dequeue() and qdisc->peek() - a removal of WARN_ON_ONCE() checks that can never trigger - the addition of netlink extack messages to some qdisc->init() failures These were split from an earlier patch set, hence the v2. ==================== Link: https://lore.kernel.org/r/20220915105046.2404072-1-vladimir.oltean@nxp.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Vladimir Oltean authored
The WARN_ON_ONCE() checks introduced in commit 13511704 ("net: taprio offload: enforce qdisc to netdev queue mapping") take a small toll on performance, but otherwise, the conditions are never expected to happen. Replace them with comments, such that the information is still conveyed to developers. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Vladimir Oltean authored
Stop contributing to the proverbial user unfriendliness of tc, and tell the user what is wrong wherever possible. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Vladimir Oltean authored
Since commit 13511704 ("net: taprio offload: enforce qdisc to netdev queue mapping"), taprio_dequeue_soft() and taprio_peek_soft() are de facto the only implementations for Qdisc_ops :: dequeue and Qdisc_ops :: peek that taprio provides. This is because in full offload mode, __dev_queue_xmit() will select a txq->qdisc which is never root taprio qdisc. So if nothing is enqueued in the root qdisc, it will never be run and nothing will get dequeued from it. Therefore, we can remove the private indirection from taprio, and always point Qdisc_ops :: dequeue to taprio_dequeue_soft (now simply named taprio_dequeue) and Qdisc_ops :: peek to taprio_peek_soft (now simply named taprio_peek). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Vladimir Oltean authored
Since commit 13511704 ("net: taprio offload: enforce qdisc to netdev queue mapping"), __dev_queue_xmit() will select a txq->qdisc for the full offload case of taprio which isn't the root taprio qdisc, so qdisc enqueues will never pass through taprio_enqueue(). That commit already introduced one safety precaution check for FULL_OFFLOAD_IS_ENABLED(); a second one is really not needed, so simplify the conditional for entering into the GSO segmentation logic. Also reword the comment a little, to appear more natural after the code change. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Vladimir Oltean authored
Sparse complains that taprio_destroy() dereferences q->oper_sched and q->admin_sched without rcu_dereference(), since they are marked as __rcu in the taprio private structure. 1671:28: warning: incorrect type in argument 1 (different address spaces) 1671:28: expected struct callback_head *head 1671:28: got struct callback_head [noderef] __rcu * 1674:28: warning: incorrect type in argument 1 (different address spaces) 1674:28: expected struct callback_head *head 1674:28: got struct callback_head [noderef] __rcu * To silence that build warning, do actually use rtnl_dereference(), since we know the rtnl_mutex is held at the time of q->destroy(). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Vladimir Oltean authored
Since the writer-side lock is taken here, we do not need to open an RCU read-side critical section, instead we can use rtnl_dereference() to tell lockdep we are serialized with concurrent writes. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Vladimir Oltean authored
The locking in taprio_offload_config_changed() is wrong (but also inconsequentially so). The current_entry_lock does not serialize changes to the admin and oper schedules, only to the current entry. In fact, the rtnl_mutex does that, and that is taken at the time when taprio_change() is called. Replace the rcu_dereference_protected() method with the proper RCU annotation, and drop the unnecessary spin lock. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jakub Kicinski authored
Simon Horman says: ==================== nfp: flower: police validation and ct enhancements this series enhances the flower hardware offload facility provided by the nfp driver. 1. Add validation of police actions created independently of flows 2. Add support offload of ct NAT action 3. Support offload of rule which has both vlan push/pop/mangle and ct action ==================== Link: https://lore.kernel.org/r/20220914160604.1740282-1-simon.horman@corigine.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Hui Zhou authored
Support hardware offload of rule which has both vlan push/pop/mangle and ct action. Signed-off-by: Hui Zhou <hui.zhou@corigine.com> Reviewed-by: Louis Peens <louis.peens@corigine.com> Signed-off-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Hui Zhou authored
support ct nat action when pre_ct merge with post_ct and nft. at the same time, add the extra checksum action and hardware stats for nft to meet the action check when do nat. Signed-off-by: Hui Zhou <hui.zhou@corigine.com> Reviewed-by: Louis Peens <louis.peens@corigine.com> Signed-off-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Ziyang Chen authored
Validation of police actions was added to offload drivers in commit d97b4b10 ("flow_offload: reject offload for all drivers with invalid police parameters") This patch extends that validation in the nfp driver to include police actions which are created independently of flows. Signed-off-by: Ziyang Chen <ziyang.chen@corigine.com> Reviewed-by: Baowen Zheng <baowen.zheng@corigine.com> Reviewed-by: Louis Peens <louis.peens@corigine.com> Signed-off-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jakub Kicinski authored
Kuniyuki Iwashima says: ==================== tcp: Introduce optional per-netns ehash. The more sockets we have in the hash table, the longer we spend looking up the socket. While running a number of small workloads on the same host, they penalise each other and cause performance degradation. The root cause might be a single workload that consumes much more resources than the others. It often happens on a cloud service where different workloads share the same computing resource. On EC2 c5.24xlarge instance (196 GiB memory and 524288 (1Mi / 2) ehash entries), after running iperf3 in different netns, creating 24Mi sockets without data transfer in the root netns causes about 10% performance regression for the iperf3's connection. thash_entries sockets length Gbps 524288 1 1 50.7 24Mi 48 45.1 It is basically related to the length of the list of each hash bucket. For testing purposes to see how performance drops along the length, I set 131072 (1Mi / 8) to thash_entries, and here's the result. thash_entries sockets length Gbps 131072 1 1 50.7 1Mi 8 49.9 2Mi 16 48.9 4Mi 32 47.3 8Mi 64 44.6 16Mi 128 40.6 24Mi 192 36.3 32Mi 256 32.5 40Mi 320 27.0 48Mi 384 25.0 To resolve the socket lookup degradation, we introduce an optional per-netns hash table for TCP, but it's just ehash, and we still share the global bhash, bhash2 and lhash2. With a smaller ehash, we can look up non-listener sockets faster and isolate such noisy neighbours. Also, we can reduce lock contention. For details, please see the last patch. patch 1 - 4: prep for per-netns ehash patch 5: small optimisation for netns dismantle without TIME_WAIT sockets patch 6: add per-netns ehash Many thanks to Eric Dumazet for reviewing and advising. ==================== Link: https://lore.kernel.org/r/20220908011022.45342-1-kuniyu@amazon.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Kuniyuki Iwashima authored
The more sockets we have in the hash table, the longer we spend looking up the socket. While running a number of small workloads on the same host, they penalise each other and cause performance degradation. The root cause might be a single workload that consumes much more resources than the others. It often happens on a cloud service where different workloads share the same computing resource. On EC2 c5.24xlarge instance (196 GiB memory and 524288 (1Mi / 2) ehash entries), after running iperf3 in different netns, creating 24Mi sockets without data transfer in the root netns causes about 10% performance regression for the iperf3's connection. thash_entries sockets length Gbps 524288 1 1 50.7 24Mi 48 45.1 It is basically related to the length of the list of each hash bucket. For testing purposes to see how performance drops along the length, I set 131072 (1Mi / 8) to thash_entries, and here's the result. thash_entries sockets length Gbps 131072 1 1 50.7 1Mi 8 49.9 2Mi 16 48.9 4Mi 32 47.3 8Mi 64 44.6 16Mi 128 40.6 24Mi 192 36.3 32Mi 256 32.5 40Mi 320 27.0 48Mi 384 25.0 To resolve the socket lookup degradation, we introduce an optional per-netns hash table for TCP, but it's just ehash, and we still share the global bhash, bhash2 and lhash2. With a smaller ehash, we can look up non-listener sockets faster and isolate such noisy neighbours. In addition, we can reduce lock contention. We can control the ehash size by a new sysctl knob. However, depending on workloads, it will require very sensitive tuning, so we disable the feature by default (net.ipv4.tcp_child_ehash_entries == 0). Moreover, we can fall back to using the global ehash in case we fail to allocate enough memory for a new ehash. The maximum size is 16Mi, which is large enough that even if we have 48Mi sockets, the average list length is 3, and regression would be less than 1%. We can check the current ehash size by another read-only sysctl knob, net.ipv4.tcp_ehash_entries. A negative value means the netns shares the global ehash (per-netns ehash is disabled or failed to allocate memory). # dmesg | cut -d ' ' -f 5- | grep "established hash" TCP established hash table entries: 524288 (order: 10, 4194304 bytes, vmalloc hugepage) # sysctl net.ipv4.tcp_ehash_entries net.ipv4.tcp_ehash_entries = 524288 # can be changed by thash_entries # sysctl net.ipv4.tcp_child_ehash_entries net.ipv4.tcp_child_ehash_entries = 0 # disabled by default # ip netns add test1 # ip netns exec test1 sysctl net.ipv4.tcp_ehash_entries net.ipv4.tcp_ehash_entries = -524288 # share the global ehash # sysctl -w net.ipv4.tcp_child_ehash_entries=100 net.ipv4.tcp_child_ehash_entries = 100 # ip netns add test2 # ip netns exec test2 sysctl net.ipv4.tcp_ehash_entries net.ipv4.tcp_ehash_entries = 128 # own a per-netns ehash with 2^n buckets When more than two processes in the same netns create per-netns ehash concurrently with different sizes, we need to guarantee the size in one of the following ways: 1) Share the global ehash and create per-netns ehash First, unshare() with tcp_child_ehash_entries==0. It creates dedicated netns sysctl knobs where we can safely change tcp_child_ehash_entries and clone()/unshare() to create a per-netns ehash. 2) Control write on sysctl by BPF We can use BPF_PROG_TYPE_CGROUP_SYSCTL to allow/deny read/write on sysctl knobs. Note that the global ehash allocated at the boot time is spread over available NUMA nodes, but inet_pernet_hashinfo_alloc() will allocate pages for each per-netns ehash depending on the current process's NUMA policy. By default, the allocation is done in the local node only, so the per-netns hash table could fully reside on a random node. Thus, depending on the NUMA policy the netns is created with and the CPU the current thread is running on, we could see some performance differences for highly optimised networking applications. Note also that the default values of two sysctl knobs depend on the ehash size and should be tuned carefully: tcp_max_tw_buckets : tcp_child_ehash_entries / 2 tcp_max_syn_backlog : max(128, tcp_child_ehash_entries / 128) As a bonus, we can dismantle netns faster. Currently, while destroying netns, we call inet_twsk_purge(), which walks through the global ehash. It can be potentially big because it can have many sockets other than TIME_WAIT in all netns. Splitting ehash changes that situation, where it's only necessary for inet_twsk_purge() to clean up TIME_WAIT sockets in each netns. With regard to this, we do not free the per-netns ehash in inet_twsk_kill() to avoid UAF while iterating the per-netns ehash in inet_twsk_purge(). Instead, we do it in tcp_sk_exit_batch() after calling tcp_twsk_purge() to keep it protocol-family-independent. In the future, we could optimise ehash lookup/iteration further by removing netns comparison for the per-netns ehash. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Kuniyuki Iwashima authored
While destroying netns, we call inet_twsk_purge() in tcp_sk_exit_batch() and tcpv6_net_exit_batch() for AF_INET and AF_INET6. These commands trigger the kernel to walk through the potentially big ehash twice even though the netns has no TIME_WAIT sockets. # ip netns add test # ip netns del test or # unshare -n /bin/true >/dev/null When tw_refcount is 1, we need not call inet_twsk_purge() at least for the net. We can save such unneeded iterations if all netns in net_exit_list have no TIME_WAIT sockets. This change eliminates the tax by the additional unshare() described in the next patch to guarantee the per-netns ehash size. Tested: # mount -t debugfs none /sys/kernel/debug/ # echo cleanup_net > /sys/kernel/debug/tracing/set_ftrace_filter # echo inet_twsk_purge >> /sys/kernel/debug/tracing/set_ftrace_filter # echo function > /sys/kernel/debug/tracing/current_tracer # cat ./add_del_unshare.sh for i in `seq 1 40` do (for j in `seq 1 100` ; do unshare -n /bin/true >/dev/null ; done) & done wait; # ./add_del_unshare.sh Before the patch: # cat /sys/kernel/debug/tracing/trace_pipe kworker/u128:0-8 [031] ...1. 174.162765: cleanup_net <-process_one_work kworker/u128:0-8 [031] ...1. 174.240796: inet_twsk_purge <-cleanup_net kworker/u128:0-8 [032] ...1. 174.244759: inet_twsk_purge <-tcp_sk_exit_batch kworker/u128:0-8 [034] ...1. 174.290861: cleanup_net <-process_one_work kworker/u128:0-8 [039] ...1. 175.245027: inet_twsk_purge <-cleanup_net kworker/u128:0-8 [046] ...1. 175.290541: inet_twsk_purge <-tcp_sk_exit_batch kworker/u128:0-8 [037] ...1. 175.321046: cleanup_net <-process_one_work kworker/u128:0-8 [024] ...1. 175.941633: inet_twsk_purge <-cleanup_net kworker/u128:0-8 [025] ...1. 176.242539: inet_twsk_purge <-tcp_sk_exit_batch After: # cat /sys/kernel/debug/tracing/trace_pipe kworker/u128:0-8 [038] ...1. 428.116174: cleanup_net <-process_one_work kworker/u128:0-8 [038] ...1. 428.262532: cleanup_net <-process_one_work kworker/u128:0-8 [030] ...1. 429.292645: cleanup_net <-process_one_work Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Kuniyuki Iwashima authored
We will soon introduce an optional per-netns ehash. This means we cannot use tcp_hashinfo directly in most places. Instead, access it via net->ipv4.tcp_death_row.hashinfo. The access will be valid only while initialising tcp_hashinfo itself and creating/destroying each netns. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Kuniyuki Iwashima authored
We will soon introduce an optional per-netns ehash. This means we cannot use the global sk->sk_prot->h.hashinfo to fetch a TCP hashinfo. Instead, set NULL to sk->sk_prot->h.hashinfo for TCP and get a proper hashinfo from net->ipv4.tcp_death_row.hashinfo. Note that we need not use sk->sk_prot->h.hashinfo if DCCP is disabled. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Kuniyuki Iwashima authored
We will soon introduce an optional per-netns ehash and access hash tables via net->ipv4.tcp_death_row->hashinfo instead of &tcp_hashinfo in most places. It could harm the fast path because dereferences of two fields in net and tcp_death_row might incur two extra cache line misses. To save one dereference, let's place tcp_death_row back in netns_ipv4 and fetch hashinfo via net->ipv4.tcp_death_row"."hashinfo. Note tcp_death_row was initially placed in netns_ipv4, and commit fbb82952 ("tcp: allocate tcp_death_row outside of struct netns_ipv4") changed it to a pointer so that we can fire TIME_WAIT timers after freeing net. However, we don't do so after commit 04c494e6 ("Revert "tcp/dccp: get rid of inet_twsk_purge()""), so we need not define tcp_death_row as a pointer. Also, we move refcount_dec_and_test(&tw_refcount) from tcp_sk_exit() to tcp_sk_exit_batch() as a debug check. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Kuniyuki Iwashima authored
This patch adds no functional change and cleans up some functions that the following patches touch around so that we make them tidy and easy to review/revert. The changes are - Keep reverse christmas tree order - Remove unnecessary init of port in inet_csk_find_open_port() - Use req_to_sk() once in reqsk_queue_unlink() - Use sock_net(sk) once in tcp_time_wait() and tcp_v[46]_connect() Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-