- 26 Oct, 2021 20 commits
-
-
David S. Miller authored
Florian Westphal says: ==================== vrf: rework interaction with netfilter/conntrack V2: - fix 'plain integer as null pointer' warning - reword commit message in patch 2 to clarify loss of 'ct set untracked' This patch series aims to solve the to-be-reverted change 09e856d5 ("vrf: Reset skb conntrack connection on VRF rcv") in a different way. Rather than have skbs pass through conntrack and nat hooks twice, suppress conntrack invocation if the conntrack/nat hook is called from the vrf driver. First patch deals with 'incoming connection' case: 1. suppress NAT transformations 2. skip conntrack confirmation NAT and conntrack confirmation is done when ip/ipv6 stack calls the postrouting hook. Second patch deals with local packets: in vrf driver, mark the skbs as 'untracked', so conntrack output hook ignores them. This skips all nat hooks as well. Afterwards, remove the untracked state again so the second round will pick them up. One alternative to the chosen implementation would be to add a 'caller id' field to 'struct nf_hook_state' and then use that, these patches use the more straightforward check of VRF flag on the state->out device. The two patches apply to both net and net-next, i am targeting -next because I think that since snat did not work correctly for so long that we can take the longer route. If you disagree, apply to net at your discretion. The patches apply both with 09e856d5 reverted or still in-place, but only with the revert in place ingress conntrack settings (zone, notrack etc) start working again. I've already submitted selftests for vrf+nfqueue and conntrack+vrf. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Florian Westphal authored
The VRF driver invokes netfilter for output+postrouting hooks so that users can create rules that check for 'oif $vrf' rather than lower device name. This is a problem when NAT rules are configured. To avoid any conntrack involvement in round 1, tag skbs as 'untracked' to prevent conntrack from picking them up. This gets cleared before the packet gets handed to the ip stack so conntrack will be active on the second iteration. One remaining issue is that a rule like output ... oif $vrfname notrack won't propagate to the second round because we can't tell 'notrack set via ruleset' and 'notrack set by vrf driver' apart. However, this isn't a regression: the 'notrack' removal happens instead of unconditional nf_reset_ct(). I'd also like to avoid leaking more vrf specific conditionals into the netfilter infra. For ingress, conntrack has already been done before the packet makes it to the vrf driver, with this patch egress does connection tracking with lower/physical device as well. Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Florian Westphal authored
The VRF driver invokes netfilter for output+postrouting hooks so that users can create rules that check for 'oif $vrf' rather than lower device name. Afterwards, ip stack calls those hooks again. This is a problem when conntrack is used with IP masquerading. masquerading has an internal check that re-validates the output interface to account for route changes. This check will trigger in the vrf case. If the -j MASQUERADE rule matched on the first iteration, then round 2 finds state->out->ifindex != nat->masq_index: the latter is the vrf index, but out->ifindex is the lower device. The packet gets dropped and the conntrack entry is invalidated. This change makes conntrack postrouting skip the nat hooks. Also skip confirmation. This allows the second round (postrouting invocation from ipv4/ipv6) to create nat bindings. This also prevents the second round from seeing packets that had their source address changed by the nat hook. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
-
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linuxDavid S. Miller authored
Saeed Mahameed says: ==================== mlx5-updates-2021-10-25 Misc updates for mlx5 driver: 1) Misc updates and cleanups: - Don't write directly to netdev->dev_addr, From Jakub Kicinski - Remove unnecessary checks for slow path flag in tc module - Fix unused function warning of mlx5i_flow_type_mask - Bridge, support replacing existing FDB entry 2) Sub Functions, Reduction in memory usage: - Reduce flow counters bulk query buffer size - Implement max_macs devlink parameter - Add devlink vendor params to control Event Queue sizes - Added SF life cycle trace points by Parav/ 3) From Aya, Firmware health buffer reporting improvements - Print health buffer by log level and more missing information - Periodic update of host time to firmware ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jon Maxwell authored
v1: Implement a more general statement as recommended by Eric Dumazet. The sequence number will be advanced, so this check will fix the FIN case and other cases. A customer reported sockets stuck in the CLOSING state. A Vmcore revealed that the write_queue was not empty as determined by tcp_write_queue_empty() but the sk_buff containing the FIN flag had been freed and the socket was zombied in that state. Corresponding pcaps show no FIN from the Linux kernel on the wire. Some instrumentation was added to the kernel and it was found that there is a timing window where tcp_sendmsg() can run after tcp_send_fin(). tcp_sendmsg() will hit an error, for example: 1269 ▹ if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))
↩ 1270 ▹ ▹ goto do_error;↩ tcp_remove_empty_skb() will then free the FIN sk_buff as "skb->len == 0". The TCP socket is now wedged in the FIN-WAIT-1 state because the FIN is never sent. If the other side sends a FIN packet the socket will transition to CLOSING and remain that way until the system is rebooted. Fix this by checking for the FIN flag in the sk_buff and don't free it if that is the case. Testing confirmed that fixed the issue. Fixes: fdfc5c85 ("tcp: remove empty skb from write queue in error cases") Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com> Reported-by: Monir Zouaoui <Monir.Zouaoui@mail.schwarz> Reported-by: Simon Stier <simon.stier@mail.schwarz> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> -
Jakub Kicinski authored
Jean Sacren says: ==================== Small fixes for true expression checks This series fixes checks of true !rc expression. ==================== Link: https://lore.kernel.org/r/cover.1634974124.git.sakiwit@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jean Sacren authored
Remove the check of !rc in (!rc && !resc_lock_params.b_granted) since it is always true. Signed-off-by: Jean Sacren <sakiwit@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jean Sacren authored
Remove the check of !rc in (!rc && !params.b_granted) since it is always true. We should also use constant 0 for return. Signed-off-by: Jean Sacren <sakiwit@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jakub Kicinski authored
Eric Dumazet says: ==================== tcp: receive path optimizations This series aims to reduce cache line misses in RX path. I am still working on better cache locality in tcp_sock but this will wait few more weeks. ==================== Link: https://lore.kernel.org/r/20211025164825.259415-1-eric.dumazet@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Eric Dumazet authored
Two kfree_skb() calls must be replaced by consume_skb() for skbs that are not technically dropped. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Eric Dumazet authored
RFC 5082 IP_MINTTL option is rarely used on hosts. Add a static key to remove from TCP fast path useless code, and potential cache line miss to fetch inet_sk(sk)->min_ttl Note that once ip4_min_ttl static key has been enabled, it stays enabled until next boot. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Eric Dumazet authored
No report yet from KCSAN, yet worth documenting the races. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Eric Dumazet authored
RFC 5082 IPV6_MINHOPCOUNT is rarely used on hosts. Add a static key to remove from TCP fast path useless code, and potential cache line miss to fetch tcp_inet6_sk(sk)->min_hopcount Note that once ip6_min_hopcount static key has been enabled, it stays enabled until next boot. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Eric Dumazet authored
No report yet from KCSAN, yet worth documenting the races. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Eric Dumazet authored
sk->sk_rx_queue_mapping can be modified locklessly, add a couple of READ_ONCE()/WRITE_ONCE() to document this fact. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Eric Dumazet authored
sk_rx_queue_mapping is located in a cache line that should be kept read mostly. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Eric Dumazet authored
sk_napi_id is located in a cache line that can be kept read mostly. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Eric Dumazet authored
Increase cache locality by moving rx_dst_coookie next to sk->sk_rx_dst This removes one or two cache line misses in IPv6 early demux (TCP/UDP) Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Eric Dumazet authored
Increase cache locality by moving rx_dst_ifindex next to sk->sk_rx_dst This is part of an effort to reduce cache line misses in TCP fast path. This removes one cache line miss in early demux. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Alexander Lobakin authored
rx_dropped, tx_dropped, rx_frame_errors and rx_crc_errors are being wrongly fetched from the target container rather than source percpu ones. No idea if that goes from the vendor driver or was brainoed during the refactoring, but fix it either way. Fixes: a97c69ba ("net: ax88796c: ASIX AX88796C SPI Ethernet Adapter Driver") Signed-off-by: Alexander Lobakin <alobakin@pm.me> Acked-by: Łukasz Stelmach <l.stelmach@samsung.com> Link: https://lore.kernel.org/r/20211023121148.113466-1-alobakin@pm.meSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
- 25 Oct, 2021 20 commits
-
-
Parav Pandit authored
Add SF device add and delete specific trace points. echo mlx5:mlx5_sf_dev_add >> /sys/kernel/debug/tracing/set_event echo mlx5:mlx5_sf_dev_del >> /sys/kernel/debug/tracing/set_event echo mlx5:mlx5_sf_vhca_event >> /sys/kernel/debug/tracing/set_event Signed-off-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
-
Parav Pandit authored
Add support for trace events for SFs to improve debugging. This covers (a) port add and free trace points (b) device level trace points (c) SF hardware context add, free trace points. (d) SF function activate/deacticate and state trace points SF events examples: echo mlx5:mlx5_sf_add >> /sys/kernel/debug/tracing/set_event echo mlx5:mlx5_sf_free >> /sys/kernel/debug/tracing/set_event echo mlx5:mlx5_sf_hwc_alloc >> /sys/kernel/debug/tracing/set_event echo mlx5:mlx5_sf_hwc_free >> /sys/kernel/debug/tracing/set_event echo mlx5:mlx5_sf_hwc_deferred_free >> /sys/kernel/debug/tracing/set_event echo mlx5:mlx5_sf_update_state >> /sys/kernel/debug/tracing/set_event echo mlx5:mlx5_sf_activate >> /sys/kernel/debug/tracing/set_event echo mlx5:mlx5_sf_deactivate >> /sys/kernel/debug/tracing/set_event Signed-off-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
-
Shay Drory authored
Currently, max_macs is taking 70Kbytes of memory per function. This size is not needed in all use cases, and is critical with large scale. Hence, allow user to configure the number of max_macs. For example, to reduce the number of max_macs to 1, execute:: $ devlink dev param set pci/0000:00:0b.0 name max_macs value 1 \ cmode driverinit $ devlink dev reload pci/0000:00:0b.0 Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
-
Shay Drory authored
Event EQ is an EQ which received the notification of almost all the events generated by the NIC. Currently, each event EQ is taking 512KB of memory. This size is not needed in most use cases, and is critical with large scale. Hence, allow user to configure the size of the event EQ. For example to reduce event EQ size to 64, execute:: $ devlink resource set pci/0000:00:0b.0 path /event_eq_size/ size 64 $ devlink dev reload pci/0000:00:0b.0 Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
-
Shay Drory authored
Currently, each I/O EQ is taking 128KB of memory. This size is not needed in all use cases, and is critical with large scale. Hence, allow user to configure the size of I/O EQs. For example, to reduce I/O EQ size to 64, execute: $ devlink resource set pci/0000:00:0b.0 path /io_eq_size/ size 64 $ devlink dev reload pci/0000:00:0b.0 Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
-
Vlad Buslov authored
The SWITCHDEV_FDB_ADD_TO_DEVICE is used for both adding new and replacing existing entry. Implement support for replacing existing FDB entries in mlx5 offload code. Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Reviewed-by: Paul Blakey <paulb@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
-
Vlad Buslov authored
Following two patterns in bridge code are used in multiple places where similar code is duplicated: - Lookup FDB entry from hashtable by address+vid pair. - Notify software bridge and then delete existing FDB entry. In order to improve code quality and prepare for following patch series that also uses described patterns, extract the codes to dedicated helper functions. This commit doesn't change functionality. Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Reviewed-by: Paul Blakey <paulb@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
-
Aya Levin authored
Firmware logs its asserts also to non-volatile memory. In order to reduce drift between the NIC and the host, the driver sets the host epoch-time to the firmware every hour. Signed-off-by: Aya Levin <ayal@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
-
Aya Levin authored
Add log macro which gets log level as a parameter. Use the severity read from the health buffer and the new log macro to log the health buffer with severity as log level. Prior to this patch, health buffer was printed in error log level regardless of its severity. Now the user may filter dmesg (--level) or change kernel log level to focus on different severity levels of firmware errors. Signed-off-by: Aya Levin <ayal@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
-
Aya Levin authored
Enhance health buffer to include: - assert_var5: expose the 6'th assert variable. - time: error's time-stamp in seconds (epoch time). - rfr: Recovery Flow Requiered. When set, indicates that the error cannot be recovered without flow involving reset. - severity: error's severity value, ranging from emergency to debug. Expose them in the health buffer dump (dmesg and devlink fw reporter). Health buffer in dmesg: mlx5_core 0000:08:00.0: print_health_info:425:(pid 912): Health issue observed, firmware internal error, severity(3) ERROR: mlx5_core 0000:08:00.0: print_health_info:429:(pid 912): assert_var[0] 0x08040700 mlx5_core 0000:08:00.0: print_health_info:429:(pid 912): assert_var[1] 0x00000000 mlx5_core 0000:08:00.0: print_health_info:429:(pid 912): assert_var[2] 0x00000000 mlx5_core 0000:08:00.0: print_health_info:429:(pid 912): assert_var[3] 0x00000000 mlx5_core 0000:08:00.0: print_health_info:429:(pid 912): assert_var[4] 0x00000000 mlx5_core 0000:08:00.0: print_health_info:429:(pid 912): assert_var[5] 0x00000000 mlx5_core 0000:08:00.0: print_health_info:432:(pid 912): assert_exit_ptr 0x00aaf800 mlx5_core 0000:08:00.0: print_health_info:434:(pid 912): assert_callra 0x00aaf70c mlx5_core 0000:08:00.0: print_health_info:436:(pid 912): fw_ver 16.32.492 mlx5_core 0000:08:00.0: print_health_info:437:(pid 912): time 1634819758 mlx5_core 0000:08:00.0: print_health_info:438:(pid 912): hw_id 0x0000020d mlx5_core 0000:08:00.0: print_health_info:439:(pid 912): rfr 0 mlx5_core 0000:08:00.0: print_health_info:440:(pid 912): severity 3 (ERROR) mlx5_core 0000:08:00.0: print_health_info:441:(pid 912): irisc_index 9 mlx5_core 0000:08:00.0: print_health_info:442:(pid 912): synd 0x1: firmware internal error mlx5_core 0000:08:00.0: print_health_info:444:(pid 912): ext_synd 0x802b mlx5_core 0000:08:00.0: print_health_info:445:(pid 912): raw fw_ver 0x102001ec Signed-off-by: Aya Levin <ayal@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
-
Avihai Horon authored
Currently, the flow counters bulk query buffer takes a little more than 512KB of memory, which is aligned to the next power of 2, to 1MB. The buffer size determines the maximum number of flow counters that can be queried at a time. Thus, having a bigger buffer can improve performance for users that need to query many flow counters. SFs don't use many flow counters and don't need a big buffer. Since this size is critical with large scale, reduce the size of the bulk query buffer for SFs. Signed-off-by: Avihai Horon <avihaih@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
-
Shay Drory authored
The cited commit is causing unused-function warning[1] when CONFIG_MLX5_EN_RXNFC is not set. Fix this by moving the function into the ifdef, where it's only used [1] warning: ‘mlx5i_flow_type_mask’ defined but not used [-Wunused-function] Fixes: 9fbe1c25 ("net/mlx5i: Enable Rx steering for IPoIB via ethtool") Signed-off-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
-
Paul Blakey authored
After previous changes, caller (mlx5e_tc_offload_fdb_rules()) already checks for the slow path flag, and if set won't call offload/unoffload sample. Signed-off-by: Paul Blakey <paulb@nvidia.com> Reviewed-by: Maor Dickman <maord@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
-
Jakub Kicinski authored
Use a local buffer and eth_hw_addr_set() Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
-
Jakub Kicinski authored
Jakub Kicinski says: ==================== bluetooth: don't write directly to netdev->dev_addr The usual conversions. ==================== Link: https://lore.kernel.org/r/20211022231834.2710245-1-kuba@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jakub Kicinski authored
Commit 406f42fa ("net-next: When a bond have a massive amount of VLANs...") introduced a rbtree for faster Ethernet address look up. To maintain netdev->dev_addr in this tree we need to make all the writes to it go through appropriate helpers. Reviewed-by: Marcel Holtmann <marcel@holtmann.org> Acked-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jakub Kicinski authored
Commit 406f42fa ("net-next: When a bond have a massive amount of VLANs...") introduced a rbtree for faster Ethernet address look up. To maintain netdev->dev_addr in this tree we need to make all the writes to it go through appropriate helpers. Convert bluetooth from memcpy(... ETH_ADDR) to eth_hw_addr_set(): @@ expression dev, np; @@ - memcpy(dev->dev_addr, np, ETH_ALEN) + eth_hw_addr_set(dev, np) Reviewed-by: Marcel Holtmann <marcel@holtmann.org> Acked-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jakub Kicinski authored
hw_addr is a uint AKA unsigned int. dev_addr_set() takes a u8 *. drivers/net/fddi/defza.c:1383:27: error: passing argument 2 of 'dev_addr_set' from incompatible pointer type [-Werror=incompatible-pointer-types] Reported-by: kernel test robot <lkp@intel.com> Fixes: 1e9258c3 ("fddi: defxx,defza: use dev_addr_set()") Acked-by: Maciej W. Rozycki <macro@orcam.me.uk> Link: https://lore.kernel.org/r/20211025160000.2803818-1-kuba@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Tianjia Zhang authored
AES_CCM_128 and CHACHA20_POLY1305 are already supported by tls, similar to setsockopt, getsockopt also needs to support these two algorithms. Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Tianjia Zhang authored
tls already supports the SM4 GCM/CCM algorithms. It is also necessary to add support for these two algorithms in tls_crypto_context to avoid potential issues caused by forced type conversion. Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-