- 12 Apr, 2024 25 commits
-
-
Fei Qin authored
Newer NIC will introduce a new part number, now add it into devlink device info. This patch also updates the information of "board.id" in nfp.rst to match the devlink-info.rst. Signed-off-by: Fei Qin <fei.qin@corigine.com> Signed-off-by: Louis Peens <louis.peens@corigine.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Fei Qin authored
Add definition and documentation for the new generic info "board.part_number". The new one is for part number specific use, and board.id is modified to match the documentation in devlink-info. Signed-off-by: Fei Qin <fei.qin@corigine.com> Signed-off-by: Louis Peens <louis.peens@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Hechao Li authored
After commit dfa2f048 ("tcp: get rid of sysctl_tcp_adv_win_scale"), we noticed an application-level timeout due to reduced throughput. Before the commit, for a client that sets SO_RCVBUF to 65k, it takes around 22 seconds to transfer 10M data. After the commit, it takes 40 seconds. Because our application has a 30-second timeout, this regression broke the application. The reason that it takes longer to transfer data is that tp->scaling_ratio is initialized to a value that results in ~0.25 of rcvbuf. In our case, SO_RCVBUF is set to 65536 by the application, which translates to 2 * 65536 = 131,072 bytes in rcvbuf and hence a ~28k initial receive window. Later, even though the scaling_ratio is updated to a more accurate skb->len/skb->truesize, which is ~0.66 in our environment, the window stays at ~0.25 * rcvbuf. This is because tp->window_clamp does not change together with the tp->scaling_ratio update when autotuning is disabled due to SO_RCVBUF. As a result, the window size is capped at the initial window_clamp, which is also ~0.25 * rcvbuf, and never grows bigger. Most modern applications let the kernel do autotuning, and benefit from the increased scaling_ratio. But there are applications such as kafka that has a default setting of SO_RCVBUF=64k. This patch increases the initial scaling_ratio from ~25% to 50% in order to make it backward compatible with the original default sysctl_tcp_adv_win_scale for applications setting SO_RCVBUF. Fixes: dfa2f048 ("tcp: get rid of sysctl_tcp_adv_win_scale") Signed-off-by: Hechao Li <hli@netflix.com> Reviewed-by: Tycho Andersen <tycho@tycho.pizza> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/netdev/20240402215405.432863-1-hli@netflix.com/Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Eric Woudstra says: ==================== rtl8226b/8221b add C45 instances and SerDes switching Based on the comments in [PATCH net-next] "Realtek RTL822x PHY rework to c45 and SerDes interface switching" Adds SerDes switching interface between 2500base-x and sgmii for rtl8221b and rtl8226b. Add get_rate_matching() for rtl8226b and rtl8221b, reading the serdes mode from phy. Driver instances are added for rtl8226b and rtl8221b for Clause 45 access only. The existing code is not touched, they use newly added functions. They also use the same rtl822xb_config_init() and rtl822xb_get_rate_matching() as these functions also can be used for direct Clause 45 access. Also Adds definition of MMC 31 registers, which cannot be used through C45-over-C22, only when phydev->is_c45 is set. Change rtlgen_get_speed() so the register value is passed as argument. Using Clause 45 access, this value is retrieved differently. Rename it to rtlgen_decode_speed() and add a call to it in rtl822x_c45_read_status(). Add rtl822x_c45_get_features() to set supported port for rtl8221b. Then 1 quirk is added for sfp modules known to have a rtl8221b behind RollBall, Clause 45 only, protocol. Changed in PATCH v4: * Changed switch to if statement in rtl822xb_get_rate_matching() * Removed setting ETHTOOL_LINK_MODE_MII_BIT in rtl822x_c45_get_features() Changed in PATCH v3: * Only apply to rtl8221b and rtl8226b phy's * Set phydev->rate_matching in .config_init() * Removed OEM SFP fixup for now, as there are modules with the same vendor name/PN, but with different PHY's. We found rtl8221b, but also the ty8821, which is not yet supported. Changed in PATCH v2: * Set author to Marek for the commit of the new C45 instances * Separate commit for setting supported ports * Renamed rtlgen_get_speed to rtlgen_decode_speed * Always fill in possible interfaces * Renamed sfp_fixup_oem_2_5g to sfp_fixup_oem_2_5gbaset * Only update phydev->interface when link is up Alexander Couzens (1): net: phy: realtek: configure SerDes mode for rtl822xb PHYs Eric Woudstra (3): net: phy: realtek: add get_rate_matching() for rtl822xb PHYs net: phy: realtek: Change rtlgen_get_speed() to rtlgen_decode_speed() net: phy: realtek: add rtl822x_c45_get_features() to set supported port ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Marek Behún authored
Add quirk for another RollBall copper transceiver: Turris RTSFP-2.5G, containing 2.5g capable RTL8221B PHY. Signed-off-by: Marek Behún <kabel@kernel.org> Signed-off-by: Eric Woudstra <ericwouds@gmail.com> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Woudstra authored
Sets ETHTOOL_LINK_MODE_TP_BIT in phydev->supported. Signed-off-by: Eric Woudstra <ericwouds@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Woudstra authored
The value of the register to determine the speed, is retrieved differently when using Clause 45 only. To use the rtlgen_get_speed() function in this case, pass the value of the register as argument to rtlgen_get_speed(). The function would then always return 0, so change it to void. A better name for this function now is rtlgen_decode_speed(). Replace a call to genphy_read_status() followed by rtlgen_get_speed() with a call to rtlgen_read_status() in rtl822x_read_status(). Add reading speed to rtl822x_c45_read_status(). Signed-off-by: Eric Woudstra <ericwouds@gmail.com> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Marek Behún authored
Collected from several commits in [PATCH net-next] "Realtek RTL822x PHY rework to c45 and SerDes interface switching" The instances are used by Clause 45 only accessible PHY's on several sfp modules, which are using RollBall protocol. Signed-off-by: Marek Behún <kabel@kernel.org> [ Added matching functions to differentiate C45 instances ] Signed-off-by: Eric Woudstra <ericwouds@gmail.com> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Woudstra authored
Uses vendor register to determine if SerDes is setup in rate-matching mode. Rate-matching only supported when SerDes is set to 2500base-x. Signed-off-by: Eric Woudstra <ericwouds@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Alexander Couzens authored
The rtl8221b and rtl8226b series support switching SerDes mode between 2500base-x and sgmii based on the negotiated copper speed. Configure this switching mode according to SerDes modes supported by host. There is an additional datasheet for RTL8226B/RTL8221B called "SERDES MODE SETTING FLOW APPLICATION NOTE" where a sequence is described to setup interface and rate adapter mode. However, there is no documentation about the meaning of registers and bits, it's literally just magic numbers and pseudo-code. Signed-off-by: Alexander Couzens <lynxis@fe80.eu> [ refactored, dropped HiSGMII mode and changed commit message ] Signed-off-by: Marek Behún <kabel@kernel.org> [ changed rtl822x_update_interface() to use vendor register ] [ always fill in possible interfaces ] [ only apply to rtl8221b and rtl8226b phy's ] [ set phydev->rate_matching in .config_init() ] Signed-off-by: Eric Woudstra <ericwouds@gmail.com> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: should come before them, without any blank lines. As the Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jakub Kicinski authored
Russell King says: ==================== net: dsa: allow phylink_mac_ops in DSA drivers This series showcases my idea of moving the phylink_mac_ops into DSA drivers, using mv88e6xxx as an example. Since I'm only changing one driver, providing the mac_ops has to be optional and the existing shims need to be kept for unconverted drivers. The first patch introduces a new helper that converts from the phylink_config structure that phylink uses to communicate with MAC drivers to the dsa_port structure. From this, DSA drivers can get the dsa_switch structure and thus their implementation specific data structure, and they can also retrieve the port index. The second patch adds the support to the core DSA layer to allow DSA drivers to provide phylink_mac_ops. The third patch converts mv88e6xxx to use this. I initially made this change after adding yet more phylink to DSA driver shims for my work with phylink-based EEE support, and decided that it was getting silly to keep implementing more and more shims. There are cases where shims don't work well - we had already tripped over a case a few years ago when the phylink mac_select_pcs operation was introduced. Phylink tested for the presence of this in the ops structure, but with DSA shims, this doesn't necessarily mean that the sub-driver supports this method. The only way to find that out is to call the method with dummy values and check the return code. The same thing was partly true when adding EEE support, and I ended up with this in phylink to determine whether the MAC supported EEE: +static bool phylink_mac_supports_eee(struct phylink *pl) +{ + return pl->mac_ops->mac_disable_tx_lpi && + pl->mac_ops->mac_enable_tx_lpi && + pl->config->lpi_capabilities; +} because merely testing for the presence of the operations is insufficient when shims are involved - and it wasn't possible to call these functions in the way that mac_select_pcs could be called. So, I think it's time to get away from this shimming model and instead have drivers directly interface to the various subsystems. This converts mv88e6xxx. I have similar patches for other DSA drivers that will be sent once this has been reviewed. ==================== Link: https://lore.kernel.org/r/ZhbrbM+d5UfgafGp@shell.armlinux.org.ukSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Russell King (Oracle) authored
Convert mv88e6xxx to provide its own phylink MAC operations, thus avoiding the shim layer in DSA's port.c Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Link: https://lore.kernel.org/r/E1rudqK-006K9N-HY@rmk-PC.armlinux.org.ukSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Russell King (Oracle) authored
Rather than having a shim for each and every phylink MAC operation, allow DSA switch drivers to provide their own ops structure. When a DSA driver provides the phylink MAC operations, the shimmed ops must not be provided, so fail an attempt to register a switch with both the phylink_mac_ops in struct dsa_switch and the phylink_mac_* operations populated in dsa_switch_ops populated. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Link: https://lore.kernel.org/r/E1rudqF-006K9H-Cc@rmk-PC.armlinux.org.ukSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Russell King (Oracle) authored
We convert from a phylink_config struct to a dsa_port struct in many places, let's provide a helper for this. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Link: https://lore.kernel.org/r/E1rudqA-006K9B-85@rmk-PC.armlinux.org.ukSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Colin Ian King authored
The variable decrypted is being assigned a value that is never read, the control of flow after the assignment is via an return path and decrypted is not referenced in this path. The assignment is redundant and can be removed. Cleans up clang scan warning: net/tls/tls_sw.c:2150:4: warning: Value stored to 'decrypted' is never read [deadcode.DeadStores] Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Link: https://lore.kernel.org/r/20240410144136.289030-1-colin.i.king@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Guillaume Nault authored
RTO_ONLINK was a flag used in ->flowi4_tos that allowed to alter the scope of an IPv4 route lookup. Setting this flag was equivalent to specifying RT_SCOPE_LINK in ->flowi4_scope. With commit ec20b283 ("ipv4: Set scope explicitly in ip_route_output()."), the last users of RTO_ONLINK have been removed. Therefore, we can now drop the code that checked this bit and stop modifying ->flowi4_scope in ip_route_output_key_hash(). Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/57de760565cab55df7b129f523530ac6475865b2.1712754146.git.gnault@redhat.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jon Maloy authored
When reading received messages from a socket with MSG_PEEK, we may want to read the contents with an offset, like we can do with pread/preadv() when reading files. Currently, it is not possible to do that. In this commit, we add support for the SO_PEEK_OFF socket option for TCP, in a similar way it is done for Unix Domain sockets. In the iperf3 log examples shown below, we can observe a throughput improvement of 15-20 % in the direction host->namespace when using the protocol splicer 'pasta' (https://passt.top). This is a consistent result. pasta(1) and passt(1) implement user-mode networking for network namespaces (containers) and virtual machines by means of a translation layer between Layer-2 network interface and native Layer-4 sockets (TCP, UDP, ICMP/ICMPv6 echo). Received, pending TCP data to the container/guest is kept in kernel buffers until acknowledged, so the tool routinely needs to fetch new data from socket, skipping data that was already sent. At the moment this is implemented using a dummy buffer passed to recvmsg(). With this change, we don't need a dummy buffer and the related buffer copy (copy_to_user()) anymore. passt and pasta are supported in KubeVirt and libvirt/qemu. jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f SO_PEEK_OFF not supported by kernel. jmaloy@freyr:~/passt# iperf3 -s ----------------------------------------------------------- Server listening on 5201 (test #1) ----------------------------------------------------------- Accepted connection from 192.168.122.1, port 44822 [ 5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 44832 [ ID] Interval Transfer Bitrate [ 5] 0.00-1.00 sec 1.02 GBytes 8.78 Gbits/sec [ 5] 1.00-2.00 sec 1.06 GBytes 9.08 Gbits/sec [ 5] 2.00-3.00 sec 1.07 GBytes 9.15 Gbits/sec [ 5] 3.00-4.00 sec 1.10 GBytes 9.46 Gbits/sec [ 5] 4.00-5.00 sec 1.03 GBytes 8.85 Gbits/sec [ 5] 5.00-6.00 sec 1.10 GBytes 9.44 Gbits/sec [ 5] 6.00-7.00 sec 1.11 GBytes 9.56 Gbits/sec [ 5] 7.00-8.00 sec 1.07 GBytes 9.20 Gbits/sec [ 5] 8.00-9.00 sec 667 MBytes 5.59 Gbits/sec [ 5] 9.00-10.00 sec 1.03 GBytes 8.83 Gbits/sec [ 5] 10.00-10.04 sec 30.1 MBytes 6.36 Gbits/sec - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate [ 5] 0.00-10.04 sec 10.3 GBytes 8.78 Gbits/sec receiver ----------------------------------------------------------- Server listening on 5201 (test #2) ----------------------------------------------------------- ^Ciperf3: interrupt - the server has terminated jmaloy@freyr:~/passt# logout [ perf record: Woken up 23 times to write data ] [ perf record: Captured and wrote 5.696 MB perf.data (35580 samples) ] jmaloy@freyr:~/passt$ jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f SO_PEEK_OFF supported by kernel. jmaloy@freyr:~/passt# iperf3 -s ----------------------------------------------------------- Server listening on 5201 (test #1) ----------------------------------------------------------- Accepted connection from 192.168.122.1, port 52084 [ 5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 52098 [ ID] Interval Transfer Bitrate [ 5] 0.00-1.00 sec 1.32 GBytes 11.3 Gbits/sec [ 5] 1.00-2.00 sec 1.19 GBytes 10.2 Gbits/sec [ 5] 2.00-3.00 sec 1.26 GBytes 10.8 Gbits/sec [ 5] 3.00-4.00 sec 1.36 GBytes 11.7 Gbits/sec [ 5] 4.00-5.00 sec 1.33 GBytes 11.4 Gbits/sec [ 5] 5.00-6.00 sec 1.21 GBytes 10.4 Gbits/sec [ 5] 6.00-7.00 sec 1.31 GBytes 11.2 Gbits/sec [ 5] 7.00-8.00 sec 1.25 GBytes 10.7 Gbits/sec [ 5] 8.00-9.00 sec 1.33 GBytes 11.5 Gbits/sec [ 5] 9.00-10.00 sec 1.24 GBytes 10.7 Gbits/sec [ 5] 10.00-10.04 sec 56.0 MBytes 12.1 Gbits/sec - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate [ 5] 0.00-10.04 sec 12.9 GBytes 11.0 Gbits/sec receiver ----------------------------------------------------------- Server listening on 5201 (test #2) ----------------------------------------------------------- ^Ciperf3: interrupt - the server has terminated logout [ perf record: Woken up 20 times to write data ] [ perf record: Captured and wrote 5.040 MB perf.data (33411 samples) ] jmaloy@freyr:~/passt$ The perf record confirms this result. Below, we can observe that the CPU spends significantly less time in the function ____sys_recvmsg() when we have offset support. Without offset support: ---------------------- jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 \ -p ____sys_recvmsg -x --stdio -i perf.data | head -1 46.32% 0.00% passt.avx2 [kernel.vmlinux] [k] do_syscall_64 ____sys_recvmsg With offset support: ---------------------- jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 \ -p ____sys_recvmsg -x --stdio -i perf.data | head -1 28.12% 0.00% passt.avx2 [kernel.vmlinux] [k] do_syscall_64 ____sys_recvmsg Suggested-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Jon Maloy <jmaloy@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240409152805.913891-1-jmaloy@redhat.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Breno Leitao authored
Commit 3e2f544d ("net: get stats64 if device if driver is configured") moved the callback to dev_get_tstats64() to net core, so, unless the driver is doing some custom stats collection, it does not need to set .ndo_get_stats64. Since this driver is now relying in NETDEV_PCPU_STAT_TSTATS, then, it doesn't need to set the dev_get_tstats64() generic .ndo_get_stats64 function pointer. Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://lore.kernel.org/r/20240409133307.2058099-2-leitao@debian.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Breno Leitao authored
With commit 34d21de9 ("net: Move {l,t,d}stats allocation to core and convert veth & vrf"), stats allocation could be done on net core instead of in this driver. With this new approach, the driver doesn't have to bother with error handling (allocation failure checking, making sure free happens in the right spot, etc). This is core responsibility now. Remove the allocation in the qmi_wwan driver and leverage the network core allocation instead. Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://lore.kernel.org/r/20240409133307.2058099-1-leitao@debian.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Eric Dumazet authored
- Use for_each_netdev_dump() to no longer rely on net->dev_index_head hash table. - No longer care of net->dev_base_seq - Fix return value at the end of a dump, so that NLMSG_DONE can be appended to current skb, saving one recvmsg() system call. - No longer grab RTNL, RCU protection is enough, afer adding one READ_ONCE(mdev->input_enabled) in mpls_netconf_fill_devconf() Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240410111951.2673193-1-edumazet@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Asbjørn Sloth Tønnesen authored
include/net/flow_offload.h:351: warning: No description found for return value of 'flow_offload_has_one_action' Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net> Reviewed-by: Pieter Jansen van Vuuren <pieter.jansen-van-vuuren@amd.com> Link: https://lore.kernel.org/r/20240410114718.15145-1-ast@fiberby.netSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Carolina Jubran authored
Q counters are device-level counters that track specific events, among which are out_of_buffer events. These events occur when packets are dropped due to a lack of receive buffer in the RX queue. Expose the total number of out_of_buffer events on the VFs/SFs to their respective representor, using the "ip stats group link" under the name of "rx_missed". The "rx_missed" equals the sum of all Q counters out_of_buffer values allocated on the VFs/SFs. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Aya Levin <ayal@nvidia.com> Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20240410214154.250583-1-tariqt@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jakub Kicinski authored
Mina Almasry says: ==================== Minor cleanups to skb frag ref/unref This series is largely motivated by a recent discussion where there was some confusion on how to properly ref/unref pp pages vs non pp pages: https://lore.kernel.org/netdev/CAHS8izOoO-EovwMwAm9tLYetwikNPxC0FKyVGu1TPJWSz4bGoA@mail.gmail.com/T/#t There is some subtely there because pp uses page->pp_ref_count for refcounting, while non-pp uses get_page()/put_page() for ref counting. Getting the refcounting pairs wrong can lead to kernel crash. Additionally currently it may not be obvious to skb users unaware of page pool internals how to properly acquire a ref on a pp frag. It requires checking of skb->pp_recycle & is_pp_page() to make the correct calls and may require some handling at the call site aware of arguable pp internals. This series is a minor refactor with a couple of goals: 1. skb users should be able to ref/unref a frag using [__]skb_frag_[un]ref() functions without needing to understand pp concepts and pp_ref_count vs get/put_page() differences. 2. reference counting functions should have a mirror opposite. I.e. there should be a foo_unref() to every foo_ref() with a mirror opposite implementation (as much as possible). This is RFC to collect feedback if this change is desirable, but also so that I don't race with the fix for the issue Dragos is seeing for his crash. https://lore.kernel.org/lkml/CAHS8izN436pn3SndrzsCyhmqvJHLyxgCeDpWXA4r1ANt3RCDLQ@mail.gmail.com/T/ ==================== Link: https://lore.kernel.org/r/20240410190505.1225848-1-almasrymina@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Mina Almasry authored
Refactor some of the skb frag ref/unref helpers for improved clarity. Implement napi_pp_get_page() to be the mirror counterpart of napi_pp_put_page(). Implement skb_page_ref() to be the mirror of skb_page_unref(). Improve __skb_frag_ref() to become a mirror counterpart of __skb_frag_unref(). Previously unref could handle pp & non-pp pages, while the ref could only handle non-pp pages. Now both the ref & unref helpers can correctly handle both pp & non-pp pages. Now that __skb_frag_ref() can handle both pp & non-pp pages, remove skb_pp_frag_ref(), and use __skb_frag_ref() instead. This lets us remove pp specific handling from skb_try_coalesce. Additionally, since __skb_frag_ref() can now handle both pp & non-pp pages, a latent issue in skb_shift() should now be fixed. Previously this function would do a non-pp ref & pp unref on potential pp frags (fragfrom). After this patch, skb_shift() should correctly do a pp ref/unref on pp frags. Signed-off-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20240410190505.1225848-3-almasrymina@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Mina Almasry authored
Add a new header, linux/skbuff_ref.h, which contains all the skb_*_ref() helpers. Many of the consumers of skbuff.h do not actually use any of the skb ref helpers, and we can speed up compilation a bit by minimizing this header file. Additionally in the later patch in the series we add page_pool support to skb_frag_ref(), which requires some page_pool dependencies. We can now add these dependencies to skbuff_ref.h instead of a very ubiquitous skbuff.h Signed-off-by: Mina Almasry <almasrymina@google.com> Link: https://lore.kernel.org/r/20240410190505.1225848-2-almasrymina@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
- 11 Apr, 2024 15 commits
-
-
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski authored
Cross-merge networking fixes after downstream PR. Conflicts: net/unix/garbage.c 47d8ac01 ("af_unix: Fix garbage collector racing against connect()") 4090fa37 ("af_unix: Replace garbage collection algorithm.") Adjacent changes: drivers/net/ethernet/broadcom/bnxt/bnxt.c faa12ca2 ("bnxt_en: Reset PTP tx_avail after possible firmware reset") b3d0083c ("bnxt_en: Support RSS contexts in ethtool .{get|set}_rxfh()") drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.c 7ac10c7d ("bnxt_en: Fix possible memory leak in bnxt_rdma_aux_device_init()") 194fad5b ("bnxt_en: Refactor bnxt_rdma_aux_device_init/uninit functions") drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c 958f56e4 ("net/mlx5e: Un-expose functions in en.h") 49e6c938 ("net/mlx5e: RSS, Block XOR hash with over 128 channels") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds authored
Pull networking fixes from Paolo Abeni: "Including fixes from bluetooth. Current release - new code bugs: - netfilter: complete validation of user input - mlx5: disallow SRIOV switchdev mode when in multi-PF netdev Previous releases - regressions: - core: fix u64_stats_init() for lockdep when used repeatedly in one file - ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr - bluetooth: fix memory leak in hci_req_sync_complete() - batman-adv: avoid infinite loop trying to resize local TT - drv: geneve: fix header validation in geneve[6]_xmit_skb - drv: bnxt_en: fix possible memory leak in bnxt_rdma_aux_device_init() - drv: mlx5: offset comp irq index in name by one - drv: ena: avoid double-free clearing stale tx_info->xdpf value - drv: pds_core: fix pdsc_check_pci_health deadlock Previous releases - always broken: - xsk: validate user input for XDP_{UMEM|COMPLETION}_FILL_RING - bluetooth: fix setsockopt not validating user input - af_unix: clear stale u->oob_skb. - nfc: llcp: fix nfc_llcp_setsockopt() unsafe copies - drv: virtio_net: fix guest hangup on invalid RSS update - drv: mlx5e: Fix mlx5e_priv_init() cleanup flow - dsa: mt7530: trap link-local frames regardless of ST Port State" * tag 'net-6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (59 commits) net: ena: Set tx_info->xdpf value to NULL net: ena: Fix incorrect descriptor free behavior net: ena: Wrong missing IO completions check order net: ena: Fix potential sign extension issue af_unix: Fix garbage collector racing against connect() net: dsa: mt7530: trap link-local frames regardless of ST Port State Revert "s390/ism: fix receive message buffer allocation" net: sparx5: fix wrong config being used when reconfiguring PCS net/mlx5: fix possible stack overflows net/mlx5: Disallow SRIOV switchdev mode when in multi-PF netdev net/mlx5e: RSS, Block XOR hash with over 128 channels net/mlx5e: Do not produce metadata freelist entries in Tx port ts WQE xmit net/mlx5e: HTB, Fix inconsistencies with QoS SQs number net/mlx5e: Fix mlx5e_priv_init() cleanup flow net/mlx5e: RSS, Block changing channels number when RXFH is configured net/mlx5: Correctly compare pkt reformat ids net/mlx5: Properly link new fs rules into the tree net/mlx5: offset comp irq index in name by one net/mlx5: Register devlink first under devlink lock net/mlx5: E-switch, store eswitch pointer before registering devlink_param ...
-
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsiLinus Torvalds authored
Pull SCSI fixes from James Bottomley: "The most important fix is the sg one because the regression it fixes (spurious warning and use after final put) is already backported to stable. The next biggest impact is the target fix for wrong credentials used to load a module because it's affecting new kernels installed on selinux based distributions. The other three fixes are an obvious off by one and SATA protocol issues" * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: scsi: qla2xxx: Fix off by one in qla_edif_app_getstats() scsi: hisi_sas: Modify the deadline for ata_wait_after_reset() scsi: hisi_sas: Handle the NCQ error returned by D2H frame scsi: target: Fix SELinux error when systemd-modules loads the target module scsi: sg: Avoid race in error handling & drop bogus warn
-
Linus Torvalds authored
Merge tag 'loongarch-fixes-6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson Pull LoongArch fixes from Huacai Chen: - make {virt, phys, page, pfn} translation work with KFENCE for LoongArch (otherwise NVMe and virtio-blk cannot work with KFENCE enabled) - update dts files for Loongson-2K series to make devices work correctly - fix a build error * tag 'loongarch-fixes-6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson: LoongArch: Include linux/sizes.h in addrspace.h to prevent build errors LoongArch: Update dts for Loongson-2K2000 to support GMAC/GNET LoongArch: Update dts for Loongson-2K2000 to support PCI-MSI LoongArch: Update dts for Loongson-2K2000 to support ISA/LPC LoongArch: Update dts for Loongson-2K1000 to support ISA/LPC LoongArch: Make virt_addr_valid()/__virt_addr_valid() work with KFENCE LoongArch: Make {virt, phys, page, pfn} translation work with KFENCE mm: Move lowmem_page_address() a little later
-
https://evilpiepirate.org/git/bcachefsLinus Torvalds authored
Pull more bcachefs fixes from Kent Overstreet: "Notable user impacting bugs - On multi device filesystems, recovery was looping in btree_trans_too_many_iters(). This checks if a transaction has touched too many btree paths (because of iteration over many keys), and isuses a restart to drop unneeded paths. But it's now possible for some paths to exceed the previous limit without iteration in the interior btree update path, since the transaction commit will do alloc updates for every old and new btree node, and during journal replay we don't use the btree write buffer for locking reasons and thus those updates use btree paths when they wouldn't normally. - Fix a corner case in rebalance when moving extents on a durability=0 device. This wouldn't be hit when a device was formatted with durability=0 since in that case we'll only use it as a write through cache (only cached extents will live on it), but durability can now be changed on an existing device. - bch2_get_acl() could rarely forget to handle a transaction restart; this manifested as the occasional missing acl that came back after dropping caches. - Fix a major performance regression on high iops multithreaded write workloads (only since 6.9-rc1); a previous fix for a deadlock in the interior btree update path to check the journal watermark introduced a dependency on the state of btree write buffer flushing that we didn't want. - Assorted other repair paths and recovery fixes" * tag 'bcachefs-2024-04-10' of https://evilpiepirate.org/git/bcachefs: (25 commits) bcachefs: Fix __bch2_btree_and_journal_iter_init_node_iter() bcachefs: Kill read lock dropping in bch2_btree_node_lock_write_nofail() bcachefs: Fix a race in btree_update_nodes_written() bcachefs: btree_node_scan: Respect member.data_allowed bcachefs: Don't scan for btree nodes when we can reconstruct bcachefs: Fix check_topology() when using node scan bcachefs: fix eytzinger0_find_gt() bcachefs: fix bch2_get_acl() transaction restart handling bcachefs: fix the count of nr_freed_pcpu after changing bc->freed_nonpcpu list bcachefs: Fix gap buffer bug in bch2_journal_key_insert_take() bcachefs: Rename struct field swap to prevent macro naming collision MAINTAINERS: Add entry for bcachefs documentation Documentation: filesystems: Add bcachefs toctree bcachefs: JOURNAL_SPACE_LOW bcachefs: Disable errors=panic for BCH_IOCTL_FSCK_OFFLINE bcachefs: Fix BCH_IOCTL_FSCK_OFFLINE for encrypted filesystems bcachefs: fix rand_delete unit test bcachefs: fix ! vs ~ typo in __clear_bit_le64() bcachefs: Fix rebalance from durability=0 device bcachefs: Print shutdown journal sequence number ...
-
Linus Torvalds authored
Merge tag 'tag-chrome-platform-fixes-for-v6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/chrome-platform/linux Pull chrome platform fix from Tzung-Bi Shih: "Fix a NULL pointer dereference" * tag 'tag-chrome-platform-fixes-for-v6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/chrome-platform/linux: platform/chrome: cros_ec_uart: properly fix race condition
-
Jakub Kicinski authored
Matthieu Baerts says: ==================== mptcp: add last time fields in mptcp_info These patches from Geliang add support for the "last time" field in MPTCP Info, and verify that the counters look valid. Patch 1 adds these counters: last_data_sent, last_data_recv and last_ack_recv. They are available in the MPTCP Info, so exposed via getsockopt(MPTCP_INFO) and the Netlink Diag interface. Patch 2 adds a test in diag.sh MPTCP selftest, to check that the counters have moved by at least 250ms, after having waited twice that time. v1: https://lore.kernel.org/r/20240405-upstream-net-next-20240405-mptcp-last-time-info-v1-0-52dc49453649@kernel.org ==================== Link: https://lore.kernel.org/r/20240410-upstream-net-next-20240405-mptcp-last-time-info-v2-0-f95bd6b33e51@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Geliang Tang authored
This patch adds a new helper chk_msk_info() to show the counters in mptcp_info of the given info, and check that the timestamps move forward. Use it to show newly added last_data_sent, last_data_recv and last_ack_recv in mptcp_info in chk_last_time_info(). Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://lore.kernel.org/r/20240410-upstream-net-next-20240405-mptcp-last-time-info-v2-2-f95bd6b33e51@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Geliang Tang authored
This patch adds "last time" fields last_data_sent, last_data_recv and last_ack_recv in struct mptcp_sock to record the last time data_sent, data_recv and ack_recv happened. They all are initialized as tcp_jiffies32 in __mptcp_init_sock(), and updated as tcp_jiffies32 too when data is sent in __subflow_push_pending(), data is received in __mptcp_move_skbs_from_subflow(), and ack is received in ack_update_msk(). Similar to tcpi_last_data_sent, tcpi_last_data_recv and tcpi_last_ack_recv exposed with TCP, this patch exposes the last time "an action happened" for MPTCP in mptcp_info, named mptcpi_last_data_sent, mptcpi_last_data_recv and mptcpi_last_ack_recv, calculated in mptcp_diag_fill_info() as the time deltas between now and the newly added last time fields in mptcp_sock. Since msk->last_ack_recv needs to be protected by mptcp_data_lock/unlock, and lock_sock_fast can sleep and be quite slow, move the entire mptcp_data_lock/unlock block after the lock/unlock_sock_fast block. Then mptcpi_last_data_sent and mptcpi_last_data_recv are set in lock/unlock_sock_fast block, while mptcpi_last_ack_recv is set in mptcp_data_lock/unlock block, which is protected by a spinlock and should not block for too long. Also add three reserved bytes in struct mptcp_info not to have holes in this structure exposed to userspace. Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/446Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Mat Martineau <martineau@kernel.org> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://lore.kernel.org/r/20240410-upstream-net-next-20240405-mptcp-last-time-info-v2-1-f95bd6b33e51@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.gitJakub Kicinski authored
Erick Archer says: ==================== mana: Add flex array to struct mana_cfg_rx_steer_req_v2 (part) The "struct mana_cfg_rx_steer_req_v2" uses a dynamically sized set of trailing elements. Specifically, it uses a "mana_handle_t" array. So, use the preferred way in the kernel declaring a flexible array [1]. At the same time, prepare for the coming implementation by GCC and Clang of the __counted_by attribute. Flexible array members annotated with __counted_by can have their accesses bounds-checked at run-time via CONFIG_UBSAN_BOUNDS (for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family functions). Also, avoid the open-coded arithmetic in the memory allocator functions [2] using the "struct_size" macro. Moreover, use the "offsetof" helper to get the indirect table offset instead of the "sizeof" operator and avoid the open-coded arithmetic in pointers using the new flex member. This new structure member also allow us to remove the "req_indir_tab" variable since it is no longer needed. Now, it is also possible to use the "flex_array_size" helper to compute the size of these trailing elements in the "memcpy" function. Specifically, the first commit adds the flex member and the patches 2 and 3 refactor the consumers of the "struct mana_cfg_rx_steer_req_v2". This code was detected with the help of Coccinelle, and audited and modified manually. The Coccinelle script used to detect this code pattern is the following: virtual report @rule1@ type t1; type t2; identifier i0; identifier i1; identifier i2; identifier ALLOC =~ "kmalloc|kzalloc|kmalloc_node|kzalloc_node|vmalloc|vzalloc|kvmalloc|kvzalloc"; position p1; @@ i0 = sizeof(t1) + sizeof(t2) * i1; ... i2 = ALLOC@p1(..., i0, ...); @script:python depends on report@ p1 << rule1.p1; @@ msg = "WARNING: verify allocation on line %s" % (p1[0].line) coccilib.report.print_report(p1[0],msg) Link: https://www.kernel.org/doc/html/next/process/deprecated.html#zero-length-and-one-element-arrays [1] Link: https://www.kernel.org/doc/html/next/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments [2] v1: https://lore.kernel.org/linux-hardening/AS8PR02MB7237974EF1B9BAFA618166C38B382@AS8PR02MB7237.eurprd02.prod.outlook.com/ v2: https://lore.kernel.org/linux-hardening/AS8PR02MB723729C5A63F24C312FC9CD18B3F2@AS8PR02MB7237.eurprd02.prod.outlook.com/ ==================== Link: https://lore.kernel.org/r/AS8PR02MB72374BD1B23728F2E3C3B1A18B022@AS8PR02MB7237.eurprd02.prod.outlook.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Erick Archer authored
This is an effort to get rid of all multiplications from allocation functions in order to prevent integer overflows [1][2]. As the "req" variable is a pointer to "struct mana_cfg_rx_steer_req_v2" and this structure ends in a flexible array: struct mana_cfg_rx_steer_req_v2 { [...] mana_handle_t indir_tab[] __counted_by(num_indir_entries); }; the preferred way in the kernel is to use the struct_size() helper to do the arithmetic instead of the calculation "size + size * count" in the kzalloc() function. Moreover, use the "offsetof" helper to get the indirect table offset instead of the "sizeof" operator and avoid the open-coded arithmetic in pointers using the new flex member. This new structure member also allow us to remove the "req_indir_tab" variable since it is no longer needed. Now, it is also possible to use the "flex_array_size" helper to compute the size of these trailing elements in the "memcpy" function. This way, the code is more readable and safer. This code was detected with the help of Coccinelle, and audited and modified manually. Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments [1] Link: https://github.com/KSPP/linux/issues/160 [2] Signed-off-by: Erick Archer <erick.archer@outlook.com> Link: https://lore.kernel.org/r/AS8PR02MB7237A21355C86EC0DCC0D83B8B022@AS8PR02MB7237.eurprd02.prod.outlook.comReviewed-by: Justin Stitt <justinstitt@google.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
-
Erick Archer authored
This is an effort to get rid of all multiplications from allocation functions in order to prevent integer overflows [1][2]. As the "req" variable is a pointer to "struct mana_cfg_rx_steer_req_v2" and this structure ends in a flexible array: struct mana_cfg_rx_steer_req_v2 { [...] mana_handle_t indir_tab[] __counted_by(num_indir_entries); }; the preferred way in the kernel is to use the struct_size() helper to do the arithmetic instead of the calculation "size + size * count" in the kzalloc() function. Moreover, use the "offsetof" helper to get the indirect table offset instead of the "sizeof" operator and avoid the open-coded arithmetic in pointers using the new flex member. This new structure member also allow us to remove the "req_indir_tab" variable since it is no longer needed. This way, the code is more readable and safer. This code was detected with the help of Coccinelle, and audited and modified manually. Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments [1] Link: https://github.com/KSPP/linux/issues/160 [2] Signed-off-by: Erick Archer <erick.archer@outlook.com> Link: https://lore.kernel.org/r/AS8PR02MB72375EB06EE1A84A67BE722E8B022@AS8PR02MB7237.eurprd02.prod.outlook.comReviewed-by: Long Li <longli@microsoft.com> Reviewed-by: Justin Stitt <justinstitt@google.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
-
Erick Archer authored
The "struct mana_cfg_rx_steer_req_v2" uses a dynamically sized set of trailing elements. Specifically, it uses a "mana_handle_t" array. So, use the preferred way in the kernel declaring a flexible array [1]. At the same time, prepare for the coming implementation by GCC and Clang of the __counted_by attribute. Flexible array members annotated with __counted_by can have their accesses bounds-checked at run-time via CONFIG_UBSAN_BOUNDS (for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family functions). This is a previous step to refactor the two consumers of this structure. drivers/infiniband/hw/mana/qp.c drivers/net/ethernet/microsoft/mana/mana_en.c The ultimate goal is to avoid the open-coded arithmetic in the memory allocator functions [2] using the "struct_size" macro. Link: https://www.kernel.org/doc/html/next/process/deprecated.html#zero-length-and-one-element-arrays [1] Link: https://www.kernel.org/doc/html/next/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments [2] Signed-off-by: Erick Archer <erick.archer@outlook.com> Link: https://lore.kernel.org/r/AS8PR02MB7237E2900247571C9CB84C678B022@AS8PR02MB7237.eurprd02.prod.outlook.comReviewed-by: Long Li <longli@microsoft.com> Reviewed-by: Justin Stitt <justinstitt@google.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
-
Paolo Abeni authored
David Arinzon says: ==================== ENA driver bug fixes From: David Arinzon <darinzon@amazon.com> This patchset contains multiple bug fixes for the ENA driver. ==================== Link: https://lore.kernel.org/r/20240410091358.16289-1-darinzon@amazon.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>
-
David Arinzon authored
The patch mentioned in the `Fixes` tag removed the explicit assignment of tx_info->xdpf to NULL with the justification that there's no need to set tx_info->xdpf to NULL and tx_info->num_of_bufs to 0 in case of a mapping error. Both values won't be used once the mapping function returns an error, and their values would be overridden by the next transmitted packet. While both values do indeed get overridden in the next transmission call, the value of tx_info->xdpf is also used to check whether a TX descriptor's transmission has been completed (i.e. a completion for it was polled). An example scenario: 1. Mapping failed, tx_info->xdpf wasn't set to NULL 2. A VF reset occurred leading to IO resource destruction and a call to ena_free_tx_bufs() function 3. Although the descriptor whose mapping failed was freed by the transmission function, it still passes the check if (!tx_info->skb) (skb and xdp_frame are in a union) 4. The xdp_frame associated with the descriptor is freed twice This patch returns the assignment of NULL to tx_info->xdpf to make the cleaning function knows that the descriptor is already freed. Fixes: 504fd6a5 ("net: ena: fix DMA mapping function issues in XDP") Signed-off-by: Shay Agroskin <shayagr@amazon.com> Signed-off-by: David Arinzon <darinzon@amazon.com> Reviewed-by: Shannon Nelson <shannon.nelson@amd.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-