Commits · 53e9426204a0dbc2d651f1088e0f7eaf19c57793 · Kirill Smelkov / linux

13 Apr, 2024 14 commits

selftests: netfilter: nft_flowtable.sh: move test to lib.sh infra · 53e94262

Florian Westphal authored Apr 12, 2024

Use socat, the different nc implementations have too much variance wrt.
supported options.

Avoid sleeping until listener is up, use busywait helper for this,
this also greatly reduces test duration.
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20240411233624.8129-15-fw@strlen.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>

53e94262

selftests: netfilter: nft_fib.sh: move to lib.sh infra · 6bc0709b

Florian Westphal authored Apr 12, 2024

Also lower ping interval, wait times (helpers get called several times)
and set nodad for ipv6 addresses: 20s down to 4s.
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20240411233624.8129-14-fw@strlen.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>

6bc0709b

selftests: netfilter: nft_conntrack_helper.sh: test to lib.sh infra · fa03bb7c

Florian Westphal authored Apr 12, 2024

prefer socat over nc, nc has too many incompatible versions around.
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20240411233624.8129-13-fw@strlen.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>

fa03bb7c

selftests: netfilter: nf_nat_edemux.sh: move to lib.sh infra · f51fe025

Florian Westphal authored Apr 12, 2024

While at it, use checktool helper.
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20240411233624.8129-12-fw@strlen.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>

f51fe025

selftests: netfilter: ipvs.sh: move to lib.sh infra · 87ce7d79

Florian Westphal authored Apr 12, 2024

The setup_ns helper makes the netns names random, so replace nsX with $nsX
everywhere.

Replace nc with socat, otherwise script fails on my system due to
incompatible nc versions ("nc: cannot use -p and -l").
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20240411233624.8129-11-fw@strlen.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>

87ce7d79

selftests: netfilter: place checktool helper in lib.sh · 10e2ed3f

Florian Westphal authored Apr 12, 2024

... so it doesn't have to be repeated everywhere.
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20240411233624.8129-10-fw@strlen.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>

10e2ed3f

selftests: netfilter: conntrack_ipip_mtu.sh" move to lib.sh infra · 0413156e

Florian Westphal authored Apr 12, 2024

Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20240411233624.8129-9-fw@strlen.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>

0413156e

selftests: netfilter: conntrack_vrf.sh: move to lib.sh infra · 954398b4

Florian Westphal authored Apr 12, 2024

swap test for "ip" with "conntrack", former is already accounted for
via setup_ns helper. Also switch to bash.
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20240411233624.8129-8-fw@strlen.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>

954398b4

selftests: netfilter: conntrack_sctp_collision.sh: move to lib.sh infra · 9785517a

Florian Westphal authored Apr 12, 2024

While at it, address warnings generated by shellcheck and fix following
minor issues:

 - some distros place netem in 'extra' modules package, so add a skip check for netem-attach
   failure.
 - tc prints a warning for the 100mbit class:
   "Warning: sch_htb: quantum of class 10001 is big. Consider r2q change."
   Silence this by increasing the divisor.
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20240411233624.8129-7-fw@strlen.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>

9785517a

selftests: netfilter: conntrack_tcp_unreplied.sh: move to lib.sh infra · 6f864d39

Florian Westphal authored Apr 12, 2024

Replace nc with socat. Too many different implementations of nc
are around with incompatible options ("nc: cannot use -p and -l").
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20240411233624.8129-6-fw@strlen.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>

6f864d39

selftests: netfilter: conntrack_icmp_related.sh: move to lib.sh infra · 96f6c273

Florian Westphal authored Apr 12, 2024

Only relevant change is that netns names have random suffix names,
i.e. its safe to run this in parallel with other tests.
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20240411233624.8129-5-fw@strlen.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>

96f6c273

selftests: netfilter: br_netfilter.sh: move to lib.sh infra · 1286e106

Florian Westphal authored Apr 12, 2024

Also, fix two issues reported by Pablo Neira:
1. Must modprobe br_netfilter in case its not loaded,
   else sysctl cannot be set.
2. ping for netns4 fails if rp_filter is enabled in bridge netns,
   so set all and default to 0.
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20240411233624.8129-4-fw@strlen.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>

1286e106

selftests: netfilter: bridge_brouter.sh: move to lib.sh infra · 94831b13

Florian Westphal authored Apr 12, 2024

Doing so gets us dynamically generated netns names.

Also:
* do not assume rp_filter is disabled, if its on script failed
* reduce timeout (-W) for "expected to fail" ping commands
* don't print PASS line for basic sanity ping
* shellcheck cleanups
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20240411233624.8129-3-fw@strlen.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>

94831b13

selftests: netfilter: move to net subdir · 3f189349

Florian Westphal authored Apr 12, 2024

.. so this can start re-using existing lib.sh infra in next patches.

Several of these scripts will not work, e.g. because they assume
rp_filter is disabled, or reliance on a particular version/flavor
of "netcat" tool.

Add config settings for them.

nft_trans_stress.sh script is removed, it also exists in the nftables
userspace selftests.  I do not see a reason to keep two versions in
different repositories/projects.

The settings file is removed for now:

It was used to increase the timeout to avoid slow scripts from getting
zapped by the 45s timeout, but some of the slow scripts can be sped up.
Re-add it later for scripts that cannot be sped up easily.

Update MAINTAINERS to reflect that future updates to netfilter
scripts should go through netfilter-devel@.
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20240411233624.8129-2-fw@strlen.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>

3f189349

12 Apr, 2024 26 commits

Merge branch 'nfp-minor-improvements' · 982a73c7

David S. Miller authored Apr 12, 2024

Louis Peens says:

====================
nfp: series of minor driver improvements

This short series bundles now only includes a small update to add a
board part number to devlink. Previously some dim patches also formed
part of this series, these were dropped in v5.

Patch1: Add new define for devlink string "board.part_number"
Patch2: Make use of this field in the nfp driver

Changes since V4:
- Dropped the dim patches, as there is a more significant rework in
  progress to make it more flexible, as mentioned in the V4 review:
  https://lore.kernel.org/all/1712547870-112976-2-git-send-email-hengqi@linux.alibaba.com/
- Updated the devlink description of 'board.part_number'

Changes since V3:
- Fixed: Documentation/networking/devlink/devlink-info.rst:150:
    WARNING: Title underline too short.

Changes since V2:
- After some discussion on the previous series it was agreed that only
  the "board.part_number" field makes sense in the common code. The
  "board.model" field which was moved to devlink common code in V1 is
  now kept in the driver. The field is specific to the nfp driver,
  exposing the codename of the board.
- In summary, add "board.part_number" to devlink, and populate it
  in the the nfp driver.

Changes since V1:
- Move nfp local defines to devlink common code as it is quite generic.
- Add new 'dim' profile instead of using driver local overrides, as this
  allows use of the 'dim' helpers.
- This expanded 2 patches to 4, as the common code changes are split
  into seperate patches.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

982a73c7

nfp: update devlink device info output · 8910f93b

Fei Qin authored Apr 10, 2024

Newer NIC will introduce a new part number, now add it
into devlink device info.

This patch also updates the information of "board.id" in
nfp.rst to match the devlink-info.rst.
Signed-off-by: Fei Qin <fei.qin@corigine.com>
Signed-off-by: Louis Peens <louis.peens@corigine.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8910f93b

devlink: add a new info version tag · 3bb946c9

Fei Qin authored Apr 10, 2024

Add definition and documentation for the new generic
info "board.part_number".

The new one is for part number specific use, and board.id
is modified to match the documentation in devlink-info.
Signed-off-by: Fei Qin <fei.qin@corigine.com>
Signed-off-by: Louis Peens <louis.peens@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3bb946c9

tcp: increase the default TCP scaling ratio · 697a6c8c

Hechao Li authored Apr 09, 2024

After commit dfa2f048 ("tcp: get rid of sysctl_tcp_adv_win_scale"),
we noticed an application-level timeout due to reduced throughput.

Before the commit, for a client that sets SO_RCVBUF to 65k, it takes
around 22 seconds to transfer 10M data. After the commit, it takes 40
seconds. Because our application has a 30-second timeout, this
regression broke the application.

The reason that it takes longer to transfer data is that
tp->scaling_ratio is initialized to a value that results in ~0.25 of
rcvbuf. In our case, SO_RCVBUF is set to 65536 by the application, which
translates to 2 * 65536 = 131,072 bytes in rcvbuf and hence a ~28k
initial receive window.

Later, even though the scaling_ratio is updated to a more accurate
skb->len/skb->truesize, which is ~0.66 in our environment, the window
stays at ~0.25 * rcvbuf. This is because tp->window_clamp does not
change together with the tp->scaling_ratio update when autotuning is
disabled due to SO_RCVBUF. As a result, the window size is capped at the
initial window_clamp, which is also ~0.25 * rcvbuf, and never grows
bigger.

Most modern applications let the kernel do autotuning, and benefit from
the increased scaling_ratio. But there are applications such as kafka
that has a default setting of SO_RCVBUF=64k.

This patch increases the initial scaling_ratio from ~25% to 50% in order
to make it backward compatible with the original default
sysctl_tcp_adv_win_scale for applications setting SO_RCVBUF.

Fixes: dfa2f048 ("tcp: get rid of sysctl_tcp_adv_win_scale")
Signed-off-by: Hechao Li <hli@netflix.com>
Reviewed-by: Tycho Andersen <tycho@tycho.pizza>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/netdev/20240402215405.432863-1-hli@netflix.com/Signed-off-by: David S. Miller <davem@davemloft.net>

697a6c8c

Merge branch 'rtl8226b-serdes-switching' · c31bd5b6

David S. Miller authored Apr 12, 2024

Eric Woudstra says:

====================
rtl8226b/8221b add C45 instances and SerDes switching

Based on the comments in [PATCH net-next]
"Realtek RTL822x PHY rework to c45 and SerDes interface switching"

Adds SerDes switching interface between 2500base-x and sgmii for
rtl8221b and rtl8226b.

Add get_rate_matching() for rtl8226b and rtl8221b, reading the serdes
mode from phy.

Driver instances are added for rtl8226b and rtl8221b for Clause 45
access only. The existing code is not touched, they use newly added
functions. They also use the same rtl822xb_config_init() and
rtl822xb_get_rate_matching() as these functions also can be used for
direct Clause 45 access. Also Adds definition of MMC 31 registers,
which cannot be used through C45-over-C22, only when phydev->is_c45
is set.

Change rtlgen_get_speed() so the register value is passed as argument.
Using Clause 45 access, this value is retrieved differently.
Rename it to rtlgen_decode_speed() and add a call to it in
rtl822x_c45_read_status().

Add rtl822x_c45_get_features() to set supported port for rtl8221b.

Then 1 quirk is added for sfp modules known to have a rtl8221b
behind RollBall, Clause 45 only, protocol.

Changed in PATCH v4:
* Changed switch to if statement in rtl822xb_get_rate_matching()
* Removed setting ETHTOOL_LINK_MODE_MII_BIT in rtl822x_c45_get_features()

Changed in PATCH v3:
* Only apply to rtl8221b and rtl8226b phy's
* Set phydev->rate_matching in .config_init()
* Removed OEM SFP fixup for now, as there are modules with the same
  vendor name/PN, but with different PHY's. We found rtl8221b, but
  also the ty8821, which is not yet supported.

Changed in PATCH v2:
* Set author to Marek for the commit of the new C45 instances
* Separate commit for setting supported ports
* Renamed rtlgen_get_speed to rtlgen_decode_speed
* Always fill in possible interfaces
* Renamed sfp_fixup_oem_2_5g to sfp_fixup_oem_2_5gbaset
* Only update phydev->interface when link is up

Alexander Couzens (1):
  net: phy: realtek: configure SerDes mode for rtl822xb PHYs

Eric Woudstra (3):
  net: phy: realtek: add get_rate_matching() for rtl822xb PHYs
  net: phy: realtek: Change rtlgen_get_speed() to rtlgen_decode_speed()
  net: phy: realtek: add rtl822x_c45_get_features() to set supported
    port
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

c31bd5b6

net: sfp: add quirk for another multigig RollBall transceiver · 1c77c721

Marek Behún authored Apr 09, 2024

Add quirk for another RollBall copper transceiver: Turris RTSFP-2.5G,
containing 2.5g capable RTL8221B PHY.
Signed-off-by: Marek Behún <kabel@kernel.org>
Signed-off-by: Eric Woudstra <ericwouds@gmail.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

1c77c721

net: phy: realtek: add rtl822x_c45_get_features() to set supported port · 2d9ce648

Eric Woudstra authored Apr 09, 2024

Sets ETHTOOL_LINK_MODE_TP_BIT in phydev->supported.
Signed-off-by: Eric Woudstra <ericwouds@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

2d9ce648

net: phy: realtek: Change rtlgen_get_speed() to rtlgen_decode_speed() · 2e4ea707

Eric Woudstra authored Apr 09, 2024

The value of the register to determine the speed, is retrieved
differently when using Clause 45 only. To use the rtlgen_get_speed()
function in this case, pass the value of the register as argument to
rtlgen_get_speed(). The function would then always return 0, so change it
to void. A better name for this function now is rtlgen_decode_speed().

Replace a call to genphy_read_status() followed by rtlgen_get_speed()
with a call to rtlgen_read_status() in rtl822x_read_status().

Add reading speed to rtl822x_c45_read_status().
Signed-off-by: Eric Woudstra <ericwouds@gmail.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

2e4ea707

net: phy: realtek: Add driver instances for rtl8221b via Clause 45 · ad5ce743

Marek Behún authored Apr 09, 2024

Collected from several commits in [PATCH net-next]
"Realtek RTL822x PHY rework to c45 and SerDes interface switching"

The instances are used by Clause 45 only accessible PHY's on several sfp
modules, which are using RollBall protocol.
Signed-off-by: Marek Behún <kabel@kernel.org>
[ Added matching functions to differentiate C45 instances ]
Signed-off-by: Eric Woudstra <ericwouds@gmail.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

ad5ce743

net: phy: realtek: add get_rate_matching() for rtl822xb PHYs · c189dbd7

Eric Woudstra authored Apr 09, 2024

Uses vendor register to determine if SerDes is setup in rate-matching mode.

Rate-matching only supported when SerDes is set to 2500base-x.
Signed-off-by: Eric Woudstra <ericwouds@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c189dbd7

net: phy: realtek: configure SerDes mode for rtl822xb PHYs · deb8af52

Alexander Couzens authored Apr 09, 2024

The rtl8221b and rtl8226b series support switching SerDes mode between
2500base-x and sgmii based on the negotiated copper speed.

Configure this switching mode according to SerDes modes supported by
host.

There is an additional datasheet for RTL8226B/RTL8221B called
"SERDES MODE SETTING FLOW APPLICATION NOTE" where a sequence is
described to setup interface and rate adapter mode.

However, there is no documentation about the meaning of registers
and bits, it's literally just magic numbers and pseudo-code.
Signed-off-by: Alexander Couzens <lynxis@fe80.eu>
[ refactored, dropped HiSGMII mode and changed commit message ]
Signed-off-by: Marek Behún <kabel@kernel.org>
[ changed rtl822x_update_interface() to use vendor register ]
[ always fill in possible interfaces ]
[ only apply to rtl8221b and rtl8226b phy's ]
[ set phydev->rate_matching in .config_init() ]
Signed-off-by: Eric Woudstra <ericwouds@gmail.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: should come before them, without any blank lines. As the
Signed-off-by: David S. Miller <davem@davemloft.net>

deb8af52

Merge branch 'net-dsa-allow-phylink_mac_ops-in-dsa-drivers' · af74be9f

Jakub Kicinski authored Apr 11, 2024

Russell King says:

====================
net: dsa: allow phylink_mac_ops in DSA drivers

This series showcases my idea of moving the phylink_mac_ops into DSA
drivers, using mv88e6xxx as an example. Since I'm only changing one
driver, providing the mac_ops has to be optional and the existing shims
need to be kept for unconverted drivers.

The first patch introduces a new helper that converts from the
phylink_config structure that phylink uses to communicate with MAC
drivers to the dsa_port structure. From this, DSA drivers can get
the dsa_switch structure and thus their implementation specific
data structure, and they can also retrieve the port index.

The second patch adds the support to the core DSA layer to allow
DSA drivers to provide phylink_mac_ops.

The third patch converts mv88e6xxx to use this.

I initially made this change after adding yet more phylink to DSA
driver shims for my work with phylink-based EEE support, and decided
that it was getting silly to keep implementing more and more shims.
There are cases where shims don't work well - we had already tripped
over a case a few years ago when the phylink mac_select_pcs operation
was introduced. Phylink tested for the presence of this in the ops
structure, but with DSA shims, this doesn't necessarily mean that
the sub-driver supports this method. The only way to find that out
is to call the method with dummy values and check the return code.

The same thing was partly true when adding EEE support, and I ended
up with this in phylink to determine whether the MAC supported EEE:

+static bool phylink_mac_supports_eee(struct phylink *pl)
+{
+       return pl->mac_ops->mac_disable_tx_lpi &&
+              pl->mac_ops->mac_enable_tx_lpi &&
+              pl->config->lpi_capabilities;
+}

because merely testing for the presence of the operations is
insufficient when shims are involved - and it wasn't possible to call
these functions in the way that mac_select_pcs could be called.

So, I think it's time to get away from this shimming model and instead
have drivers directly interface to the various subsystems.

This converts mv88e6xxx. I have similar patches for other DSA drivers
that will be sent once this has been reviewed.
====================

Link: https://lore.kernel.org/r/ZhbrbM+d5UfgafGp@shell.armlinux.org.ukSigned-off-by: Jakub Kicinski <kuba@kernel.org>

af74be9f

net: dsa: mv88e6xxx: provide own phylink MAC operations · 0cb6da0c

Russell King (Oracle) authored Apr 10, 2024

Convert mv88e6xxx to provide its own phylink MAC operations, thus
avoiding the shim layer in DSA's port.c
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://lore.kernel.org/r/E1rudqK-006K9N-HY@rmk-PC.armlinux.org.ukSigned-off-by: Jakub Kicinski <kuba@kernel.org>

0cb6da0c

net: dsa: allow DSA switch drivers to provide their own phylink mac ops · cae425cb

Russell King (Oracle) authored Apr 10, 2024

Rather than having a shim for each and every phylink MAC operation,
allow DSA switch drivers to provide their own ops structure. When a
DSA driver provides the phylink MAC operations, the shimmed ops must
not be provided, so fail an attempt to register a switch with both
the phylink_mac_ops in struct dsa_switch and the phylink_mac_*
operations populated in dsa_switch_ops populated.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://lore.kernel.org/r/E1rudqF-006K9H-Cc@rmk-PC.armlinux.org.ukSigned-off-by: Jakub Kicinski <kuba@kernel.org>

cae425cb

net: dsa: introduce dsa_phylink_to_port() · dd0c9855

Russell King (Oracle) authored Apr 10, 2024

We convert from a phylink_config struct to a dsa_port struct in many
places, let's provide a helper for this.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Link: https://lore.kernel.org/r/E1rudqA-006K9B-85@rmk-PC.armlinux.org.ukSigned-off-by: Jakub Kicinski <kuba@kernel.org>

dd0c9855

tls: remove redundant assignment to variable decrypted · f7ac8fbd

Colin Ian King authored Apr 10, 2024

The variable decrypted is being assigned a value that is never read,
the control of flow after the assignment is via an return path and
decrypted is not referenced in this path. The assignment is redundant
and can be removed.

Cleans up clang scan warning:
net/tls/tls_sw.c:2150:4: warning: Value stored to 'decrypted' is never
read [deadcode.DeadStores]
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Link: https://lore.kernel.org/r/20240410144136.289030-1-colin.i.king@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

f7ac8fbd

ipv4: Remove RTO_ONLINK. · 5618603f

Guillaume Nault authored Apr 10, 2024

RTO_ONLINK was a flag used in ->flowi4_tos that allowed to alter the
scope of an IPv4 route lookup. Setting this flag was equivalent to
specifying RT_SCOPE_LINK in ->flowi4_scope.

With commit ec20b283 ("ipv4: Set scope explicitly in
ip_route_output()."), the last users of RTO_ONLINK have been removed.
Therefore, we can now drop the code that checked this bit and stop
modifying ->flowi4_scope in ip_route_output_key_hash().
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/57de760565cab55df7b129f523530ac6475865b2.1712754146.git.gnault@redhat.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

5618603f

tcp: add support for SO_PEEK_OFF socket option · 05ea4916

Jon Maloy authored Apr 09, 2024

When reading received messages from a socket with MSG_PEEK, we may want
to read the contents with an offset, like we can do with pread/preadv()
when reading files. Currently, it is not possible to do that.

In this commit, we add support for the SO_PEEK_OFF socket option for TCP,
in a similar way it is done for Unix Domain sockets.

In the iperf3 log examples shown below, we can observe a throughput
improvement of 15-20 % in the direction host->namespace when using the
protocol splicer 'pasta' (https://passt.top).
This is a consistent result.

pasta(1) and passt(1) implement user-mode networking for network
namespaces (containers) and virtual machines by means of a translation
layer between Layer-2 network interface and native Layer-4 sockets
(TCP, UDP, ICMP/ICMPv6 echo).

Received, pending TCP data to the container/guest is kept in kernel
buffers until acknowledged, so the tool routinely needs to fetch new
data from socket, skipping data that was already sent.

At the moment this is implemented using a dummy buffer passed to
recvmsg(). With this change, we don't need a dummy buffer and the
related buffer copy (copy_to_user()) anymore.

passt and pasta are supported in KubeVirt and libvirt/qemu.

jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f
SO_PEEK_OFF not supported by kernel.

jmaloy@freyr:~/passt# iperf3 -s
-----------------------------------------------------------
Server listening on 5201 (test #1)
-----------------------------------------------------------
Accepted connection from 192.168.122.1, port 44822
[  5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 44832
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  1.02 GBytes  8.78 Gbits/sec
[  5]   1.00-2.00   sec  1.06 GBytes  9.08 Gbits/sec
[  5]   2.00-3.00   sec  1.07 GBytes  9.15 Gbits/sec
[  5]   3.00-4.00   sec  1.10 GBytes  9.46 Gbits/sec
[  5]   4.00-5.00   sec  1.03 GBytes  8.85 Gbits/sec
[  5]   5.00-6.00   sec  1.10 GBytes  9.44 Gbits/sec
[  5]   6.00-7.00   sec  1.11 GBytes  9.56 Gbits/sec
[  5]   7.00-8.00   sec  1.07 GBytes  9.20 Gbits/sec
[  5]   8.00-9.00   sec   667 MBytes  5.59 Gbits/sec
[  5]   9.00-10.00  sec  1.03 GBytes  8.83 Gbits/sec
[  5]  10.00-10.04  sec  30.1 MBytes  6.36 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-10.04  sec  10.3 GBytes  8.78 Gbits/sec   receiver
-----------------------------------------------------------
Server listening on 5201 (test #2)
-----------------------------------------------------------
^Ciperf3: interrupt - the server has terminated
jmaloy@freyr:~/passt#
logout
[ perf record: Woken up 23 times to write data ]
[ perf record: Captured and wrote 5.696 MB perf.data (35580 samples) ]
jmaloy@freyr:~/passt$

jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f
SO_PEEK_OFF supported by kernel.

jmaloy@freyr:~/passt# iperf3 -s
-----------------------------------------------------------
Server listening on 5201 (test #1)
-----------------------------------------------------------
Accepted connection from 192.168.122.1, port 52084
[  5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 52098
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  1.32 GBytes  11.3 Gbits/sec
[  5]   1.00-2.00   sec  1.19 GBytes  10.2 Gbits/sec
[  5]   2.00-3.00   sec  1.26 GBytes  10.8 Gbits/sec
[  5]   3.00-4.00   sec  1.36 GBytes  11.7 Gbits/sec
[  5]   4.00-5.00   sec  1.33 GBytes  11.4 Gbits/sec
[  5]   5.00-6.00   sec  1.21 GBytes  10.4 Gbits/sec
[  5]   6.00-7.00   sec  1.31 GBytes  11.2 Gbits/sec
[  5]   7.00-8.00   sec  1.25 GBytes  10.7 Gbits/sec
[  5]   8.00-9.00   sec  1.33 GBytes  11.5 Gbits/sec
[  5]   9.00-10.00  sec  1.24 GBytes  10.7 Gbits/sec
[  5]  10.00-10.04  sec  56.0 MBytes  12.1 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-10.04  sec  12.9 GBytes  11.0 Gbits/sec  receiver
-----------------------------------------------------------
Server listening on 5201 (test #2)
-----------------------------------------------------------
^Ciperf3: interrupt - the server has terminated
logout
[ perf record: Woken up 20 times to write data ]
[ perf record: Captured and wrote 5.040 MB perf.data (33411 samples) ]
jmaloy@freyr:~/passt$

The perf record confirms this result. Below, we can observe that the
CPU spends significantly less time in the function ____sys_recvmsg()
when we have offset support.

Without offset support:
----------------------
jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 \
                       -p ____sys_recvmsg -x --stdio -i  perf.data | head -1
46.32%     0.00%  passt.avx2  [kernel.vmlinux]  [k] do_syscall_64  ____sys_recvmsg

With offset support:
----------------------
jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 \
                       -p ____sys_recvmsg -x --stdio -i  perf.data | head -1
28.12%     0.00%  passt.avx2  [kernel.vmlinux]  [k] do_syscall_64  ____sys_recvmsg
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240409152805.913891-1-jmaloy@redhat.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

05ea4916

net: usb: qmi_wwan: Remove generic .ndo_get_stats64 · 3cddfeca

Breno Leitao authored Apr 09, 2024

Commit 3e2f544d ("net: get stats64 if device if driver is
configured") moved the callback to dev_get_tstats64() to net core, so,
unless the driver is doing some custom stats collection, it does not
need to set .ndo_get_stats64.

Since this driver is now relying in NETDEV_PCPU_STAT_TSTATS, then, it
doesn't need to set the dev_get_tstats64() generic .ndo_get_stats64
function pointer.
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20240409133307.2058099-2-leitao@debian.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

3cddfeca

net: usb: qmi_wwan: Leverage core stats allocator · 8959bf2a

Breno Leitao authored Apr 09, 2024

With commit 34d21de9 ("net: Move {l,t,d}stats allocation to core and
convert veth & vrf"), stats allocation could be done on net core
instead of in this driver.

With this new approach, the driver doesn't have to bother with error
handling (allocation failure checking, making sure free happens in the
right spot, etc). This is core responsibility now.

Remove the allocation in the qmi_wwan driver and leverage the network
core allocation instead.
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20240409133307.2058099-1-leitao@debian.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

8959bf2a

mpls: no longer hold RTNL in mpls_netconf_dump_devconf() · e0f89d28

Eric Dumazet authored Apr 10, 2024

- Use for_each_netdev_dump() to no longer rely
  on net->dev_index_head hash table.

- No longer care of net->dev_base_seq

- Fix return value at the end of a dump,
  so that NLMSG_DONE can be appended to current skb,
  saving one recvmsg() system call.

- No longer grab RTNL, RCU protection is enough,
  afer adding one READ_ONCE(mdev->input_enabled)
  in mpls_netconf_fill_devconf()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240410111951.2673193-1-edumazet@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

e0f89d28

flow_offload: fix flow_offload_has_one_action() kdoc · e1eb10f8

Asbjørn Sloth Tønnesen authored Apr 10, 2024

include/net/flow_offload.h:351: warning:
No description found for return value of 'flow_offload_has_one_action'
Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Pieter Jansen van Vuuren <pieter.jansen-van-vuuren@amd.com>
Link: https://lore.kernel.org/r/20240410114718.15145-1-ast@fiberby.netSigned-off-by: Jakub Kicinski <kuba@kernel.org>

e1eb10f8

net/mlx5e: Expose the VF/SF RX drop counter on the representor · 919b38a9

Carolina Jubran authored Apr 11, 2024

Q counters are device-level counters that track specific
events, among which are out_of_buffer events. These events
occur when packets are dropped due to a lack of receive
buffer in the RX queue.

Expose the total number of out_of_buffer events on the
VFs/SFs to their respective representor, using the
"ip stats group link" under the name of "rx_missed".

The "rx_missed" equals the sum of all
Q counters out_of_buffer values allocated on the VFs/SFs.
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20240410214154.250583-1-tariqt@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

919b38a9

Merge branch 'minor-cleanups-to-skb-frag-ref-unref' · ef4ba011

Jakub Kicinski authored Apr 11, 2024

Mina Almasry says:

====================
Minor cleanups to skb frag ref/unref

This series is largely motivated by a recent discussion where there was
some confusion on how to properly ref/unref pp pages vs non pp pages:

https://lore.kernel.org/netdev/CAHS8izOoO-EovwMwAm9tLYetwikNPxC0FKyVGu1TPJWSz4bGoA@mail.gmail.com/T/#t

There is some subtely there because pp uses page->pp_ref_count for
refcounting, while non-pp uses get_page()/put_page() for ref counting.
Getting the refcounting pairs wrong can lead to kernel crash.

Additionally currently it may not be obvious to skb users unaware of
page pool internals how to properly acquire a ref on a pp frag. It
requires checking of skb->pp_recycle & is_pp_page() to make the correct
calls and may require some handling at the call site aware of arguable pp
internals.

This series is a minor refactor with a couple of goals:

1. skb users should be able to ref/unref a frag using
   [__]skb_frag_[un]ref() functions without needing to understand pp
   concepts and pp_ref_count vs get/put_page() differences.

2. reference counting functions should have a mirror opposite. I.e. there
   should be a foo_unref() to every foo_ref() with a mirror opposite
   implementation (as much as possible).

This is RFC to collect feedback if this change is desirable, but also so
that I don't race with the fix for the issue Dragos is seeing for his
crash.

https://lore.kernel.org/lkml/CAHS8izN436pn3SndrzsCyhmqvJHLyxgCeDpWXA4r1ANt3RCDLQ@mail.gmail.com/T/
====================

Link: https://lore.kernel.org/r/20240410190505.1225848-1-almasrymina@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

ef4ba011

net: mirror skb frag ref/unref helpers · a580ea99

Mina Almasry authored Apr 10, 2024

Refactor some of the skb frag ref/unref helpers for improved clarity.

Implement napi_pp_get_page() to be the mirror counterpart of
napi_pp_put_page().

Implement skb_page_ref() to be the mirror of skb_page_unref().

Improve __skb_frag_ref() to become a mirror counterpart of
__skb_frag_unref(). Previously unref could handle pp & non-pp pages,
while the ref could only handle non-pp pages. Now both the ref & unref
helpers can correctly handle both pp & non-pp pages.

Now that __skb_frag_ref() can handle both pp & non-pp pages, remove
skb_pp_frag_ref(), and use __skb_frag_ref() instead.  This lets us
remove pp specific handling from skb_try_coalesce.

Additionally, since __skb_frag_ref() can now handle both pp & non-pp
pages, a latent issue in skb_shift() should now be fixed. Previously
this function would do a non-pp ref & pp unref on potential pp frags
(fragfrom). After this patch, skb_shift() should correctly do a pp
ref/unref on pp frags.
Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20240410190505.1225848-3-almasrymina@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

a580ea99

net: move skb ref helpers to new header · f6d827b1

Mina Almasry authored Apr 10, 2024

Add a new header, linux/skbuff_ref.h, which contains all the skb_*_ref()
helpers. Many of the consumers of skbuff.h do not actually use any of
the skb ref helpers, and we can speed up compilation a bit by minimizing
this header file.

Additionally in the later patch in the series we add page_pool support
to skb_frag_ref(), which requires some page_pool dependencies. We can
now add these dependencies to skbuff_ref.h instead of a very ubiquitous
skbuff.h
Signed-off-by: Mina Almasry <almasrymina@google.com>
Link: https://lore.kernel.org/r/20240410190505.1225848-2-almasrymina@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

f6d827b1