Commits · 207184853dbdb62d8b02c7a141d3297e94e33451 · Kirill Smelkov / linux

16 Dec, 2023 3 commits

tcp/dccp: change source port selection at connect() time · 20718485

Eric Dumazet authored Dec 14, 2023

In commit 1580ab63 ("tcp/dccp: better use of ephemeral ports in connect()")
we added an heuristic to select even ports for connect() and odd ports for bind().

This was nice because no applications changes were needed.

But it added more costs when all even ports are in use,
when there are few listeners and many active connections.

Since then, IP_LOCAL_PORT_RANGE has been added to permit an application
to partition ephemeral port range at will.

This patch extends the idea so that if IP_LOCAL_PORT_RANGE is set on
a socket before accept(), port selection no longer favors even ports.

This means that connect() can find a suitable source port faster,
and applications can use a different split between connect() and bind()
users.

This should give more entropy to Toeplitz hash used in RSS: Using even
ports was wasting one bit from the 16bit sport.

A similar change can be done in inet_csk_find_open_port() if needed.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://lore.kernel.org/r/20231214192939.1962891-3-edumazet@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

20718485

inet: returns a bool from inet_sk_get_local_port_range() · 41db7626

Eric Dumazet authored Dec 14, 2023

Change inet_sk_get_local_port_range() to return a boolean,
telling the callers if the port range was provided by
IP_LOCAL_PORT_RANGE socket option.

Adds documentation while we are at it.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20231214192939.1962891-2-edumazet@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

41db7626

dt-bindings: net: marvell,orion-mdio: Drop "reg" sizes schema · 758a8d5b

Rob Herring authored Dec 13, 2023

Defining the size of register regions is not really in scope of what
bindings need to cover. The schema for this is also not completely correct
as a reg entry can be variable number of cells for the address and size,
but the schema assumes 1 cell.
Signed-off-by: Rob Herring <robh@kernel.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20231213232455.2248056-1-robh@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

758a8d5b

15 Dec, 2023 37 commits

hv_netvsc: remove duplicated including of slab.h · e91db161

Wang Jinchao authored Dec 15, 2023

rm the second include <linux/slab.h>
Signed-off-by: Wang Jinchao <wangjinchao@xfusion.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e91db161

Merge branch 'netlink-specs-legacy' · f06f0891

David S. Miller authored Dec 15, 2023

Jakub Kicinski says:

====================
netlink: specs: prep legacy specs for C code gen

Minor adjustments to some specs to make them ready for C code gen.

v2:
 - fix MAINATINERS and subject of patch 3
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

f06f0891

netlink: specs: mptcp: rename the MPTCP path management spec · b059aef7

Jakub Kicinski authored Dec 14, 2023

We assume in handful of places that the name of the spec is
the same as the name of the family. We could fix that but
it seems like a fair assumption to make. Rename the MPTCP
spec instead.
Reviewed-by: Mat Martineau <martineau@kernel.org>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

b059aef7

netlink: specs: ovs: correct enum names in specs · 209bcb9a

Jakub Kicinski authored Dec 14, 2023

Align the enum-names of OVS with what's actually in the uAPI.
Either correct the names, or mark the enum as empty because
the values are in fact #defines.
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

209bcb9a

netlink: specs: ovs: remove fixed header fields from attrs · 3ada0b33

Jakub Kicinski authored Dec 14, 2023

Op's "attributes" list is a workaround for families with a single
attr set. We don't want to render a single huge request structure,
the same for each op since we know that most ops accept only a small
set of attributes. "Attributes" list lets us narrow down the attributes
to what op acctually pays attention to.

It doesn't make sense to put names of fixed headers in there.
They are not "attributes" and we can't really narrow down the struct
members.

Remove the fixed header fields from attrs for ovs families
in preparation for C codegen support.
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

3ada0b33

Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue · 283f105b

David S. Miller authored Dec 15, 2023

Tony Nguyen says:

====================
add v2 FW logging for ice driver

Paul Stillwell says:

Firmware (FW) log support was added to the ice driver, but that version is
no longer supported. There is a newer version of FW logging (v2) that
adds more control knobs to get the exact data out of the FW
for debugging.

The interface for FW logging is debugfs. This was chosen based on
discussions here:
https://lore.kernel.org/netdev/20230214180712.53fc8ba2@kernel.org/ and
https://lore.kernel.org/netdev/20231012164033.1069fb4b@kernel.org/

We talked about using devlink in a variety of ways, but none of those
options made any sense for the way the FW reports data. We briefly talked
about using ethtool, but that seemed to go by the wayside. Ultimately it
seems like using debugfs is the way to go so re-implement the code to use
that.

FW logging is across all the PFs on the device so restrict the commands to
only PF0.

If the device supports FW logging then a directory named 'fwlog' will be
created under '/sys/kernel/debug/ice/<pci_dev>'. A variety of files will be
created to manage the behavior of logging. The following files will be
created:
- modules/<module>
- nr_messages
- enable
- log_size
- data

where
modules/<module> is used to read/write the log level for a specific module

nr_messages is used to determine how many events should be in each message
sent to the driver

enable is used to start/stop FW logging. This is a boolean value so only 1
or 0 are permissible values

log_size is used to configure the amount of memory the driver uses for log
data

data is used to read/clear the log data

Generally there is a lot of data and dumping that data to syslog will
result in a loss of data. This causes problems when decoding the data and
the user doesn't know that data is missing until later. Instead of dumping
the FW log output to syslog use debugfs. This ensures that all the data the
driver has gets retrieved correctly.

The FW log data is binary data that the FW team decodes to determine what
happened in firmware. The binary blob is sent to Intel for decoding.
---
v6:
- use seq_printf() for outputting module info when reading from 'module' file
- replace code that created argc and argv for handling command line input
- removed checks in all the _read() and _write() functions to see if FW logging
  is supported because the files will not exist if it is not supported
- removed warnings on allocation failures on debugfs file creation failures
- removed a newline between memory allocation and checking if the memory was
  allocated
- fixed cases where we could just return the value from a function call
  instead of saving the value in a variable
- moved the check for PFO in ice_fwlog_init() to an earlier patch
- reworked all of argument scanning in the _write() functions in ice_debugfs.c
  to remove adding characters past the end of the buffer

v5: https://lore.kernel.org/netdev/20231205211251.2122874-1-anthony.l.nguyen@intel.com/
- changed the log level configuration from a single file for all modules to a
  file per module.
- changed 'nr_buffs' to 'log_size' because users understand memory sizes
  better than a number of buffers
- changed 'resolution' to 'nr_messages' to better reflect what it represents
- updated documentation to reflect these changes
- updated documentation to indicate that FW logging must be disabled to
  clear the data. also clarified that any value written to the 'data' file will
  clear the data

v4: https://lore.kernel.org/netdev/20231005170110.3221306-1-anthony.l.nguyen@intel.com/
- removed CONFIG_DEBUG_FS wrapper around code because the debugfs calls handle
  this case already
- moved ice_debugfs_exit() call to remove unreachable code issue
- minor changes to documentation based on feedback

v3: https://lore.kernel.org/netdev/20230815165750.2789609-1-anthony.l.nguyen@intel.com/
- Adjust error path cleanup in ice_module_init() for unreachable code.

v2: https://lore.kernel.org/netdev/20230810170109.1963832-1-anthony.l.nguyen@intel.com/
- Rewrote code to use debugfs instead of devlink

v1: https://lore.kernel.org/netdev/20230209190702.3638688-1-anthony.l.nguyen@intel.com/
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

283f105b

Merge branch 'mv88e6xxx-counters' · b84d66b0

David S. Miller authored Dec 15, 2023

Tobias Waldekranz says:

====================
net: dsa: mv88e6xxx: Add "eth-mac" and "rmon" counter group support

The majority of the changes (2/8) are about refactoring the existing
ethtool statistics support to make it possible to read individual
counters, rather than the whole set.

4/8 tries to collect all information about a stat in a single place
using a mapper macro, which is then used to generate the original list
of stats, along with a matching enum. checkpatch is less than amused
with this construct, but prior art exists (__BPF_FUNC_MAPPER in
include/uapi/linux/bpf.h, for example).

To support the histogram counters from the "rmon" group, we have to
change mv88e6xxx's configuration of them. Instead of counting rx and
tx, we restrict them to rx-only. 6/8 has the details.

With that in place, adding the actual counter groups is pretty
straight forward (5,7/8).

Tie it all together with a selftest (8/8).

v3 -> v4:
- Return size_t from mv88e6xxx_stats_get_stats
- Spelling errors in commit message of 6/8
- Improve selftest:
  - Report progress per-bucket
  - Test both ports in the pair
  - Increase MTU, if required

v2 -> v3:
- Added 6/8
- Added 8/8

v1 -> v2:
- Added 1/6
- Added 3/6
- Changed prototype of stats operation to reflect the fact that the
  number of read stats are returned, no errors
- Moved comma into MV88E6XXX_HW_STAT_MAPPER definition
- Avoid the construction of mapping table iteration which relied on
  struct layouts outside of mv88e6xxx's control
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

b84d66b0

selftests: forwarding: ethtool_rmon: Add histogram counter test · 00e7f29d

Tobias Waldekranz authored Dec 14, 2023

Validate the operation of rx and tx histogram counters, if supported
by the interface, by sending batches of packets targeted for each
bucket.
Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

00e7f29d

net: dsa: mv88e6xxx: Add "rmon" counter group support · 394518e3

Tobias Waldekranz authored Dec 14, 2023

Report the applicable subset of an mv88e6xxx port's counters using
ethtool's standardized "rmon" counter group.
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

394518e3

net: dsa: mv88e6xxx: Limit histogram counters to ingress traffic · ceea48ef

Tobias Waldekranz authored Dec 14, 2023

Chips in this family only have one set of histogram counters, which
can be used to count ingressing and/or egressing traffic. mv88e6xxx
has, up until this point, kept the hardware default of counting both
directions.

In the mean time, standard counter group support has been added to
ethtool. Via that interface, drivers may report ingress-only and
egress-only histograms separately - but not combined.

In order for mv88e6xxx to maximize amount of diagnostic information
that can be exported via standard interfaces, we opt to limit the
histogram counters to ingress traffic only. Which will allow us to
export them via the standard "rmon" group in an upcoming commit.

The reason for choosing ingress-only over egress-only, is to be
compatible with RFC2819 (RMON MIB).
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ceea48ef

net: dsa: mv88e6xxx: Add "eth-mac" counter group support · 0e047cec

Tobias Waldekranz authored Dec 14, 2023

Report the applicable subset of an mv88e6xxx port's counters using
ethtool's standardized "eth-mac" counter group.
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0e047cec

net: dsa: mv88e6xxx: Give each hw stat an ID · 5780acbd

Tobias Waldekranz authored Dec 14, 2023

With the upcoming standard counter group support, we are no longer
reading out the whole set of counters, but rather mapping a subset to
the requested group.

Therefore, create an enum with an ID for each stat, such that
mv88e6xxx_hw_stats[] can be subscripted with a human-readable ID
corresponding to the counter's name.
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5780acbd

net: dsa: mv88e6xxx: Fix mv88e6352_serdes_get_stats error path · fc82a08a

Tobias Waldekranz authored Dec 14, 2023

mv88e6xxx_get_stats, which collects stats from various sources,
expects all callees to return the number of stats read. If an error
occurs, 0 should be returned.

Prevent future mishaps of this kind by updating the return type to
reflect this contract.
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fc82a08a

net: dsa: mv88e6xxx: Create API to read a single stat counter · 3def80e5

Tobias Waldekranz authored Dec 14, 2023

This change contains no functional change. We simply push the hardware
specific stats logic to a function reading a single counter, rather
than the whole set.

This is a preparatory change for the upcoming standard ethtool
statistics support (i.e. "eth-mac", "eth-ctrl" etc.).
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3def80e5

net: dsa: mv88e6xxx: Push locking into stats snapshotting · d624afaf

Tobias Waldekranz authored Dec 14, 2023

This is more consistent with the driver's general structure.
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d624afaf

Merge branch 'net-optmem_max-changes' · 9ed816b1

David S. Miller authored Dec 15, 2023

Eric Dumazet says:

====================
net: optmem_max changes

optmem_max default value is too small for tx zerocopy workloads.

First patch increases default from 20KB to 128 KB,
which is the value we have used for seven years.

Second patch makes optmem_max sysctl per netns.

Last patch tweaks two tests accordingly.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

9ed816b1

selftests/net: optmem_max became per netns · 18872ba8

Eric Dumazet authored Dec 14, 2023

/proc/sys/net/core/optmem_max is now per netns, change two tests
that were saving/changing/restoring its value on the parent netns.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

18872ba8

net: Namespace-ify sysctl_optmem_max · f5769fae

Eric Dumazet authored Dec 14, 2023

optmem_max being used in tx zerocopy,
we want to be able to control it on a netns basis.

Following patch changes two tests.

Tested:

oqq130:~# cat /proc/sys/net/core/optmem_max
131072
oqq130:~# echo 1000000 >/proc/sys/net/core/optmem_max
oqq130:~# cat /proc/sys/net/core/optmem_max
1000000
oqq130:~# unshare -n
oqq130:~# cat /proc/sys/net/core/optmem_max
131072
oqq130:~# exit
logout
oqq130:~# cat /proc/sys/net/core/optmem_max
1000000
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f5769fae

net: increase optmem_max default value · 49445667

Eric Dumazet authored Dec 14, 2023

For many years, /proc/sys/net/core/optmem_max default value
on a 64bit kernel has been 20 KB.

Regular usage of TCP tx zerocopy needs a bit more.

Google has used 128KB as the default value for 7 years without
any problem.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

49445667

Merge branch 'mlxsw-CFF-flood-mode' · e16064c9

David S. Miller authored Dec 15, 2023

Petr Machata says:

====================
mlxsw: CFF flood mode: NVE underlay configuration

Recently, support for CFF flood mode (for Compressed FID Flooding) was
added to the mlxsw driver. The most recent patchset has a detailed coverage
of what CFF is and what has changed and how:

    https://lore.kernel.org/netdev/cover.1701183891.git.petrm@nvidia.com/

In CFF flood mode, each FID allocates a handful (in our implementation two
or three) consecutive PGT entries. One entry holds the flood vector for
unknown-UC traffic, one for MC, one for BC.

To determine how to look up flood vectors, the CFF flood mode uses a
concept of flood profiles, which are IDs that reference mappings from
traffic types to offsets. In the case of CFF flood mode, the offset in
question is applied to the PGT address configured at a FID. The same
mechanism is used by NVE underlay for flooding. Again the profile ID and
the traffic type determine the offset to apply, this time to KVD address
used to look up flooding entries. Since mlxsw configures NVE underlay flood
the same regardless of traffic type, only one offset was ever needed: the
zero, which is the default, and thus no explicit configuration was needed.

Now that CFF uses profiles as well, it would be better to configure the
profile used by NVE explicitly, to make the configuration visible in the
source code.

In this patchset, add the register support (in patch #1), add a new traffic
type to refer to "any traffic at all" (in patch #2) and finally configure
the NVE profile explicitly for FIDs (in patch #3).

So far, the implicitly configured flood profile was the ID 0. With this
patchset, it changes to 3, leaving the 0 free to allow us to spot missed
configuration.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

e16064c9

mlxsw: spectrum_fid: Set NVE flood profile as part of FID configuration · 6dab4083

Petr Machata authored Dec 14, 2023

The NVE flood profile is used for determining of offset applied to KVD
address for NVE flood. We currently do not set it, leaving it at the
default value of 0. That is not an issue: all the traffic-type-to-offset
mappings (as configured by SFFP) default to offset of 0. This is what we
need anyway, as mlxsw only allocates a single KVD entry for NVE underlay.

The field is only relevant on Spectrum-2 and above. So to be fully
consistent, we should split the existing controlled ops to Spectrum-1 and
Spectrum>1 variants, with only the latter setting the field. But that seems
like a lot of overhead for a single field whose meaning is "everything is
the default". So instead pretend that the NVE flood profile does not exist
in the controlled flood mode, like we have so far, and only set it when
flood mode is CFF.

Setting this at all serves dual purpose. First, it is now clear which
profile belongs to NVE, because in the CFF mode, we have multiple users.
This should prevent bugs in flood profile management. Second, using
specifically non-zero value means there will be no valid uses of the
profile 0, which we can therefore use as a sentinel.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6dab4083

mlxsw: spectrum_fid: Add an "any" packet type · b2f5eb5a

Petr Machata authored Dec 14, 2023

Flood profiles have been used prior to CFF support for NVE underlay. Like
is the case with FID flooding, an NVE profile describes at which offset a
datum is located given traffic type. mlxsw currently only ever uses one KVD
entry for NVE lookup, i.e. regardless of traffic type, the offset is always
zero. To be able to describe this, add a traffic type enumerator describing
"any traffic type".
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b2f5eb5a

mlxsw: reg: Add nve_flood_prf_id field to SFMR · d9d441e8

Petr Machata authored Dec 14, 2023

The field is used for setting a flood profile for lookup of KVD entry for
NVE underlay. As the other uses of flood profile, this references a traffic
type-to-offset mapping, except here it is not applied to PGT offsets, but
KVD offsets.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d9d441e8

Merge branch 'net-at803x-cleanups' · 523e1f5f

David S. Miller authored Dec 15, 2023

Christian Marangi says:

====================
net: phy: at803x: additional cleanup for qca808x

This small series is a preparation for the big code split. While the
qca808x code is waiting to be reviwed and merged, we can further cleanup
and generalize shared functions between at803x and qca808x.

With these last 2 patch everything is ready to move the driver to a
dedicated directory and split the code by creating a library module
for the few shared functions between the 2 driver.

Eventually at803x can be further cleaned and generalized but everything
will be already self contained and related only to at803x family of PHYs.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

523e1f5f

net: phy: at803x: make read specific status function more generic · 38eb804e

Christian Marangi authored Dec 14, 2023

Rework read specific status function to be more generic. The function
apply different speed mask based on the PHY ID. Make it more generic by
adding an additional arg to pass the specific speed (ss) mask and use
the provided mask to parse the speed value.

This is needed to permit an easier deatch of qca808x code from the
at803x driver.
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

38eb804e

net: phy: at803x: move specific qca808x config_aneg to dedicated function · 8e732f1c

Christian Marangi authored Dec 14, 2023

Move specific qca808x config_aneg to dedicated function to permit easier
split of qca808x portion from at803x driver.
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8e732f1c

Merge branch 'vsock-credit-update' · 6da0bcb8

David S. Miller authored Dec 15, 2023

Arseniy Krasnov says:

====================
send credit update during setting SO_RCVLOWAT

                               DESCRIPTION

This patchset fixes old problem with hungup of both rx/tx sides and adds
test for it. This happens due to non-default SO_RCVLOWAT value and
deferred credit update in virtio/vsock. Link to previous old patchset:
https://lore.kernel.org/netdev/39b2e9fd-601b-189d-39a9-914e5574524c@sberdevices.ru/

Here is what happens step by step:

                                  TEST

                            INITIAL CONDITIONS

1) Vsock buffer size is 128KB.
2) Maximum packet size is also 64KB as defined in header (yes it is
   hardcoded, just to remind about that value).
3) SO_RCVLOWAT is default, e.g. 1 byte.

                                 STEPS

            SENDER                              RECEIVER
1) sends 128KB + 1 byte in a
   single buffer. 128KB will
   be sent, but for 1 byte
   sender will wait for free
   space at peer. Sender goes
   to sleep.

2)                                     reads 64KB, credit update not sent
3)                                     sets SO_RCVLOWAT to 64KB + 1
4)                                     poll() -> wait forever, there is
                                       only 64KB available to read.

So in step 4) receiver also goes to sleep, waiting for enough data or
connection shutdown message from the sender. Idea to fix it is that rx
kicks tx side to continue transmission (and may be close connection)
when rx changes number of bytes to be woken up (e.g. SO_RCVLOWAT) and
this value is bigger than number of available bytes to read.

I've added small test for this, but not sure as it uses hardcoded value
for maximum packet length, this value is defined in kernel header and
used to control deferred credit update. And as this is not available to
userspace, I can't control test parameters correctly (if one day this
define will be changed - test may become useless).

Head for this patchset is:
https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=9bab51bd662be4c3ebb18a28879981d69f3ef15a

Link to v1:
https://lore.kernel.org/netdev/20231108072004.1045669-1-avkrasnov@salutedevices.com/
Link to v2:
https://lore.kernel.org/netdev/20231119204922.2251912-1-avkrasnov@salutedevices.com/
Link to v3:
https://lore.kernel.org/netdev/20231122180510.2297075-1-avkrasnov@salutedevices.com/
Link to v4:
https://lore.kernel.org/netdev/20231129212519.2938875-1-avkrasnov@salutedevices.com/
Link to v5:
https://lore.kernel.org/netdev/20231130130840.253733-1-avkrasnov@salutedevices.com/
Link to v6:
https://lore.kernel.org/netdev/20231205064806.2851305-1-avkrasnov@salutedevices.com/
Link to v7:
https://lore.kernel.org/netdev/20231206211849.2707151-1-avkrasnov@salutedevices.com/
Link to v8:
https://lore.kernel.org/netdev/20231211211658.2904268-1-avkrasnov@salutedevices.com/
Link to v9:
https://lore.kernel.org/netdev/20231214091947.395892-1-avkrasnov@salutedevices.com/

Changelog:
v1 -> v2:
 * Patchset rebased and tested on new HEAD of net-next (see hash above).
 * New patch is added as 0001 - it removes return from SO_RCVLOWAT set
   callback in 'af_vsock.c' when transport callback is set - with that
   we can set 'sk_rcvlowat' only once in 'af_vsock.c' and in future do
   not copy-paste it to every transport. It was discussed in v1.
 * See per-patch changelog after ---.
v2 -> v3:
 * See changelog after --- in 0003 only (0001 and 0002 still same).
v3 -> v4:
 * Patchset rebased and tested on new HEAD of net-next (see hash above).
 * See per-patch changelog after ---.
v4 -> v5:
 * Change patchset tag 'RFC' -> 'net-next'.
 * See per-patch changelog after ---.
v5 -> v6:
 * New patch 0003 which sends credit update during reading bytes from
   socket.
 * See per-patch changelog after ---.
v6 -> v7:
 * Patchset rebased and tested on new HEAD of net-next (see hash above).
 * See per-patch changelog after ---.
v7 -> v8:
 * See per-patch changelog after ---.
v8 -> v9:
 * Patchset rebased and tested on new HEAD of net-next (see hash above).
 * Add 'Fixes' tag for the current 0002.
 * Reorder patches by moving two fixes first.
v9 -> v10:
 * Squash 0002 and 0003 and update commit message in result.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

6da0bcb8

vsock/test: two tests to check credit update logic · 542e893f

Arseniy Krasnov authored Dec 14, 2023

Both tests are almost same, only differs in two 'if' conditions, so
implemented in a single function. Tests check, that credit update
message is sent:

1) During setting SO_RCVLOWAT value of the socket.
2) When number of 'rx_bytes' become smaller than SO_RCVLOWAT value.
Signed-off-by: Arseniy Krasnov <avkrasnov@salutedevices.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

542e893f

virtio/vsock: send credit update during setting SO_RCVLOWAT · 0fe17989

Arseniy Krasnov authored Dec 14, 2023

Send credit update message when SO_RCVLOWAT is updated and it is bigger
than number of bytes in rx queue. It is needed, because 'poll()' will
wait until number of bytes in rx queue will be not smaller than
O_RCVLOWAT, so kick sender to send more data. Otherwise mutual hungup
for tx/rx is possible: sender waits for free space and receiver is
waiting data in 'poll()'.

Rename 'set_rcvlowat' callback to 'notify_set_rcvlowat' and set
'sk->sk_rcvlowat' only in one place (i.e. 'vsock_set_rcvlowat'), so the
transport doesn't need to do it.

Fixes: b89d882d ("vsock/virtio: reduce credit update messages")
Signed-off-by: Arseniy Krasnov <avkrasnov@salutedevices.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0fe17989

virtio/vsock: fix logic which reduces credit update messages · 93b80887

Arseniy Krasnov authored Dec 14, 2023

Add one more condition for sending credit update during dequeue from
stream socket: when number of bytes in the rx queue is smaller than
SO_RCVLOWAT value of the socket. This is actual for non-default value
of SO_RCVLOWAT (e.g. not 1) - idea is to "kick" peer to continue data
transmission, because we need at least SO_RCVLOWAT bytes in our rx
queue to wake up user for reading data (in corner case it is also
possible to stuck both tx and rx sides, this is why 'Fixes' is used).

Fixes: b89d882d ("vsock/virtio: reduce credit update messages")
Signed-off-by: Arseniy Krasnov <avkrasnov@salutedevices.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

93b80887

ipmr: support IP_PKTINFO on cache report IGMP msg · bb740365

Leone Fernando authored Dec 13, 2023

In order to support IP_PKTINFO on those packets, we need to call
ipv4_pktinfo_prepare.

When sending mrouted/pimd daemons a cache report IGMP msg, it is
unnecessary to set dst on the newly created skb.
It used to be necessary on older versions until
commit d826eb14 ("ipv4: PKTINFO doesnt need dst reference") which
changed the way IP_PKTINFO struct is been retrieved.

Changes from v1:
1. Undo changes in ipv4_pktinfo_prepare function. use it directly
   and copy the control block.

Fixes: d826eb14 ("ipv4: PKTINFO doesnt need dst reference")
Signed-off-by: Leone Fernando <leone4fernando@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bb740365

net: mana: add msix index sharing between EQs · 02fed6d9

Konstantin Taranov authored Dec 13, 2023

This patch allows to assign and poll more than one EQ on the same
msix index.
It is achieved by introducing a list of attached EQs in each IRQ context.
It also removes the existing msix_index map that tried to ensure that there
is only one EQ at each msix_index.
This patch exports symbols for creating EQs from other MANA kernel modules.
Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

02fed6d9

octeontx2-af: Fix multicast/mirror group lock/unlock issue · 10b7572d

Suman Ghosh authored Dec 13, 2023

As per the existing implementation, there exists a race between finding
a multicast/mirror group entry and deleting that entry. The group lock
was taken and released independently by rvu_nix_mcast_find_grp_elem()
function. Which is incorrect and group lock should be taken during the
entire operation of group updation/deletion. This patch fixes the same.

Fixes: 51b2804c ("octeontx2-af: Add new mbox to support multicast/mirror offload")
Signed-off-by: Suman Ghosh <sumang@marvell.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

10b7572d

Merge tag 'mlx5-updates-2023-12-13' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · 12da68e2

David S. Miller authored Dec 15, 2023

Saeed Mahameed says:

====================
mlx5-updates-2023-12-13

Preparation for mlx5e socket direct feature.

Socket direct will allow multiple PF devices attached to different
NUMA nodes but sharing the same physical port.

The following series is a small refactoring series in preparation
to support socket direct in the following submission.

Highlights:
 - Define required device registers and bits related to socket direct
 - Flow steering re-arrangements
 - Generalize TX objects (TISs) and store them in a common object, will
   be useful in the next series for per function object management.
 - Decouple raw CQ objects from their parent netdev priv
 - Prepare devcom for Socket Direct device group discovery.

Please see the individual patches for more information.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

12da68e2

Merge branch 'net-phy-rust' · d6beb085

David S. Miller authored Dec 15, 2023

FUJITA Tomonori says:

====================
Rust abstractions for network PHY drivers

No functional change since v10; only comment and commit log updates.

This patchset adds Rust abstractions for phylib. It doesn't fully
cover the C APIs yet but I think that it's already useful. I implement
two PHY drivers (Asix AX88772A PHYs and Realtek Generic FE-GE). Seems
they work well with real hardware.

The first patch introduces Rust bindings for phylib.

The second patch adds a macro to declare a kernel module for PHYs
drivers.

The third adds the Rust ETHERNET PHY LIBRARY entry to MAINTAINERS
file; adds the binding file and me as a maintainer (as Andrew Lunn
suggested) with Trevor Gross as a reviewer.

The last patch introduces the Rust version of Asix PHY driver,
drivers/net/phy/ax88796b.c. The features are equivalent to the C
version. You can choose C (by default) or Rust version on kernel
configuration.

v11:
  - adds Andrew, Alice, and Trevor's Reviewed-by
  - comment update
v10: https://lore.kernel.org/netdev/20231210234924.1453917-1-fujita.tomonori@gmail.com/T/
  - adds Trevor's SoB to the third patch
  - adds Benno's Reviewed-by to the second patch
v9: https://lore.kernel.org/netdev/20231205.124531.842372711631366729.fujita.tomonori@gmail.com/T/
  - adds a workaround to access to a bit field in phy_device
  - fixes a comment typo
v8: https://lore.kernel.org/netdev/20231123050412.1012252-1-fujita.tomonori@gmail.com/
  - updates the safety comments on Device and its related code
  - uses _phy_start_aneg instead of phy_start_aneg
  - drops the patch for enum synchronization
  - moves Sync From Registration to DriverVTable
  - fixes doctest errors
  - minor cleanups
v7: https://lore.kernel.org/netdev/20231026001050.1720612-1-fujita.tomonori@gmail.com/T/
  - renames get_link() to is_link_up()
  - improves the macro format
  - improves the commit log in the third patch
  - improves comments
v6: https://lore.kernel.org/netdev/20231025.090243.1437967503809186729.fujita.tomonori@gmail.com/T/
  - improves comments
  - makes the requirement of phy_drivers_register clear
  - fixes Makefile of the third patch
v5: https://lore.kernel.org/all/20231019.094147.1808345526469629486.fujita.tomonori@gmail.com/T/
  - drops the rustified-enum option, writes match by hand; no *risk* of UB
  - adds Miguel's patch for enum checking
  - moves CONFIG_RUST_PHYLIB_ABSTRACTIONS to drivers/net/phy/Kconfig
  - adds a new entry for this abstractions in MAINTAINERS
  - changes some of Device's methods to take &mut self
  - comment improvment
v4: https://lore.kernel.org/netdev/20231012125349.2702474-1-fujita.tomonori@gmail.com/T/
  - split the core patch
  - making Device::from_raw() private
  - comment improvement with code update
  - commit message improvement
  - avoiding using bindings::phy_driver in public functions
  - using an anonymous constant in module_phy_driver macro
v3: https://lore.kernel.org/netdev/20231011.231607.1747074555988728415.fujita.tomonori@gmail.com/T/
  - changes the base tree to net-next from rust-next
  - makes this feature optional; only enabled with CONFIG_RUST_PHYLIB_BINDINGS=y
  - cosmetic code and comment improvement
  - adds copyright
v2: https://lore.kernel.org/netdev/20231006094911.3305152-2-fujita.tomonori@gmail.com/T/
  - build failure fix
  - function renaming
v1: https://lore.kernel.org/netdev/20231002085302.2274260-3-fujita.tomonori@gmail.com/T/
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

d6beb085

net: phy: add Rust Asix PHY driver · cbe0e415

FUJITA Tomonori authored Dec 13, 2023

This is the Rust implementation of drivers/net/phy/ax88796b.c. The
features are equivalent. You can choose C or Rust version kernel
configuration.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@gmail.com>
Reviewed-by: Trevor Gross <tmgross@umich.edu>
Reviewed-by: Benno Lossin <benno.lossin@proton.me>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cbe0e415

MAINTAINERS: add Rust PHY abstractions for ETHERNET PHY LIBRARY · cbaa28f9

FUJITA Tomonori authored Dec 13, 2023

Adds me as a maintainer and Trevor as a reviewer.

The files are placed at rust/kernel/ directory for now but the files
are likely to be moved to net/ directory once a new Rust build system
is implemented.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@gmail.com>
Signed-off-by: Trevor Gross <tmgross@umich.edu>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

cbaa28f9