Commits · 82fe335b78f76772acb6053b57bce95a0ad789bb · Kirill Smelkov / linux

27 Jan, 2023 16 commits

Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue · 82fe335b

Jakub Kicinski authored Jan 26, 2023

Tony Nguyen says:

====================
virtchnl: update and refactor

Jesse Brandeburg says:

The virtchnl.h file is used by i40e/ice physical function (PF) drivers
and irdma when talking to the iavf driver. This series cleans up the
header file by removing unused elements, adding/cleaning some comments,
fixing the data structures so they are explicitly defined, including
padding, and finally does a long overdue rename of the IWARP members in
the structures to RDMA, since the ice driver and it's associated Intel
Ethernet E800 series adapters support both RDMA and IWARP.

The whole series should result in no functional change, but hopefully
clearer code.

* '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  virtchnl: i40e/iavf: rename iwarp to rdma
  virtchnl: do structure hardening
  virtchnl: update header and increase header clarity
  virtchnl: remove unused structure declaration
====================

Link: https://lore.kernel.org/r/20230125212441.4030014-1-anthony.l.nguyen@intel.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

82fe335b

Merge branch 'tools-ynl-prevent-reorder-and-fix-flags' · 0313afe8

Jakub Kicinski authored Jan 26, 2023

Jakub Kicinski says:

====================
tools: ynl: prevent reorder and fix flags

Some codegen improvements for YAML specs.

First, Lorenzon discovered when switching the XDP feature family
to use flags instead of pure enum that the kdoc got garbled.
The support for enum and flags is therefore unified.

Second when regenerating all families we discussed so far I noticed
that some netlink policies jumped around. We need to ensure we don't
render code based on their ordering in a hash.
====================

Link: https://lore.kernel.org/r/20230126000235.1085551-1-kuba@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

0313afe8

tools: ynl: store ops in ordered dict to avoid random ordering · 3a43ded0

Jakub Kicinski authored Jan 25, 2023

When rendering code we should walk the ops in the order in which
they are declared in the spec. This is both more intuitive and
prevents code from jumping around when hashing in the dict changes.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

3a43ded0

tools: ynl: rename ops_list -> msg_list · b49c34e2

Jakub Kicinski authored Jan 25, 2023

ops_list contains all the operations, but the main iteration use
case is to walk only ops which define attrs. Rename ops_list to
msg_list, because now it looks like the contents are the same,
just the format is different. While at it convert from tuple
to just keys, none of the users care about the name of the op.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

b49c34e2

tools: ynl: support kdocs for flags in code generation · 66fa34b9

Jakub Kicinski authored Jan 25, 2023

Lorenzo reports that after switching from enum to flags netdev
family lost ability to render kdoc (and the enum contents got
generally garbled).

Combine the flags and enum handling in uAPI handling.
Reported-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

66fa34b9

Merge branch 'convert-drivers-to-return-xfrm-configuration-errors-through-extack' · 868c82f3

Jakub Kicinski authored Jan 26, 2023

Leon Romanovsky says:

====================
Convert drivers to return XFRM configuration errors through extack

This series continues effort started by Sabrina to return XFRM configuration
errors through extack. It allows for user space software stack easily present
driver failure reasons to users.

As a note, Intel drivers have a path where extack is equal to NULL, and error
prints won't be available in current patchset. If it is needed, it can be
changed by adding special to Intel macro to print to dmesg in case of
extack == NULL.
====================

Link: https://lore.kernel.org/r/cover.1674560845.git.leon@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

868c82f3

cxgb4: fill IPsec state validation failure reason · 8c284ea4

Leon Romanovsky authored Jan 24, 2023

Rely on extack to return failure reason.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

8c284ea4

bonding: fill IPsec state validation failure reason · 3fe57986

Leon Romanovsky authored Jan 24, 2023

Rely on extack to return failure reason.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

3fe57986

ixgbe: fill IPsec state validation failure reason · 505c500c

Leon Romanovsky authored Jan 24, 2023

Rely on extack to return failure reason.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

505c500c

ixgbevf: fill IPsec state validation failure reason · c068ec5c

Leon Romanovsky authored Jan 24, 2023

Rely on extack to return failure reason.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

c068ec5c

nfp: fill IPsec state validation failure reason · 05ddf5f8

Leon Romanovsky authored Jan 24, 2023

Rely on extack to return failure reason.
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

05ddf5f8

netdevsim: Fill IPsec state validation failure reason · 6c486979

Leon Romanovsky authored Jan 24, 2023

Rely on extack to return failure reason.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

6c486979

net/mlx5e: Fill IPsec state validation failure reason · 902812b8

Leon Romanovsky authored Jan 24, 2023

Rely on extack to return failure reason.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

902812b8

xfrm: extend add state callback to set failure reason · 7681a4f5

Leon Romanovsky authored Jan 24, 2023

Almost all validation logic is in the drivers, but they are
missing reliable way to convey failure reason to userspace
applications.

Let's use extack to return this information to users.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

7681a4f5

net/mlx5e: Fill IPsec policy validation failure reason · 1bb70c5a

Leon Romanovsky authored Jan 24, 2023

Rely on extack to return failure reason.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

1bb70c5a

xfrm: extend add policy callback to set failure reason · 3089386d

Leon Romanovsky authored Jan 24, 2023

Almost all validation logic is in the drivers, but they are
missing reliable way to convey failure reason to userspace
applications.

Let's use extack to return this information to users.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

3089386d

26 Jan, 2023 24 commits

net: ethtool: provide shims for stats aggregation helpers when CONFIG_ETHTOOL_NETLINK=n · 9179f5fe

Vladimir Oltean authored Jan 25, 2023

ethtool_aggregate_*_stats() are implemented in net/ethtool/stats.c, a
file which is compiled out when CONFIG_ETHTOOL_NETLINK=n. In order to
avoid adding Kbuild dependencies from drivers (which call these helpers)
on CONFIG_ETHTOOL_NETLINK, let's add some shim definitions which simply
make the helpers dead code.

This means the function prototypes should have been located in
include/linux/ethtool_netlink.h rather than include/linux/ethtool.h.

Fixes: 449c5459 ("net: ethtool: add helpers for aggregate statistics")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/20230125110214.4127759-1-vladimir.oltean@nxp.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>

9179f5fe

Merge branch 'mptcp-add-mixed-v4-v6-support-for-the-in-kernel-pm' · 97f7d3dd

Paolo Abeni authored Jan 26, 2023

Matthieu Baerts says:

====================
mptcp: add mixed v4/v6 support for the in-kernel PM

Before these patches, the in-kernel Path-Manager would not allow, for
the same MPTCP connection, having a mix of subflows in v4 and v6.

MPTCP's RFC 8684 doesn't forbid that and it is even recommended to do so
as the path in v4 and v6 are likely different. Some networks are also
v4 or v6 only, we cannot assume they all have both v4 and v6 support.

Patch 1 then removes this artificial constraint in the in-kernel PM
currently enforcing there are no mixed subflows in place, either in
address announcement or in subflow creation areas.

Patch 2 makes sure the sk_ipv6only attribute is also propagated to
subflows, just in case a new PM wouldn't respect it.

Some selftests have also been added for the in-kernel PM (patch 3).

Patches 4 to 8 are just some cleanups and small improvements in the
printed messages in the userspace PM. It is not linked to the rest but
identified when working on a related patch modifying this selftest,
already in -net:

  commit 4656d72c ("selftests: mptcp: userspace: validate v4-v6 subflows mix")
---
====================

Link: https://lore.kernel.org/r/20230123-upstream-net-next-pm-v4-v6-v1-0-43fac502bfbf@tessares.netSigned-off-by: Paolo Abeni <pabeni@redhat.com>

97f7d3dd

selftests: mptcp: userspace: avoid read errors · 8dbdf24f

Matthieu Baerts authored Jan 25, 2023

During the cleanup phase, the server pids were killed with a SIGTERM
directly, not using a SIGUSR1 first to quit safely. As a result, this
test was often ending with two error messages:

  read: Connection reset by peer

While at it, use a for-loop to terminate all the PIDs the same way.

Also the different files are now removed after having killed the PIDs
using them. It makes more sense to do that in this order.
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

8dbdf24f

selftests: mptcp: userspace: print error details if any · 10d42734

Matthieu Baerts authored Jan 25, 2023

Before, only '[FAIL]' was printed in case of error during the validation
phase.

Now, in case of failure, the variable name, its value and expected one
are displayed to help understand what was wrong.
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

10d42734

selftests: mptcp: userspace: refactor asserts · 1c0b0ee2

Matthieu Baerts authored Jan 25, 2023

Instead of having a long list of conditions to check, it is possible to
give a list of variable names to compare with their 'e_XXX' version.

This will ease the introduction of the following commit which will print
which condition has failed (if any).
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

1c0b0ee2

selftests: mptcp: userspace: print titles · f790ae03

Matthieu Baerts authored Jan 25, 2023

This script is running a few tests after having setup the environment.

Printing titles helps understand what is being tested.
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

f790ae03

mptcp: userspace pm: use a single point of exit · 40c71f76

Matthieu Baerts authored Jan 25, 2023

Like in all other functions in this file, a single point of exit is used
when extra operations are needed: unlock, decrement refcount, etc.

There is no functional change for the moment but it is better to do the
same here to make sure all cleanups are done in case of intermediate
errors.
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

40c71f76

selftests: mptcp: add test-cases for mixed v4/v6 subflows · ad349374

Paolo Abeni authored Jan 25, 2023

Note that we can't guess the listener family anymore based on the client
target address: always use IPv6.

The fullmesh flag with endpoints from different families is also
validated here.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ad349374

mptcp: propagate sk_ipv6only to subflows · 7e9740e0

Matthieu Baerts authored Jan 25, 2023

Usually, attributes are propagated to subflows as well.

Here, if subflows are created by other ways than the MPTCP path-manager,
it is important to make sure they are in v6 if it is asked by the
userspace.
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

7e9740e0

mptcp: let the in-kernel PM use mixed IPv4 and IPv6 addresses · b9d69db8

Paolo Abeni authored Jan 25, 2023

Currently the in-kernel PM arbitrary enforces that created subflow's
family must match the main MPTCP socket while the RFC allows mixing
IPv4 and IPv6 subflows.

This patch changes the in-kernel PM logic to create subflows matching
the currently selected source (or destination) address. IPv4 sockets
can pick only IPv4 addresses (and v4 mapped in v6), while IPv6 sockets
not restricted to V6ONLY can pick either IPv4 and IPv6 addresses as
long as the source and destination matches.

A helper, previously introduced is used to ease family matching checks,
taking care of IPv4 vs IPv4-mapped-IPv6 vs IPv6 only addresses.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/269Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

b9d69db8

icmp: Add counters for rate limits · d0941130

Jamie Bainbridge authored Jan 25, 2023

There are multiple ICMP rate limiting mechanisms:

* Global limits: net.ipv4.icmp_msgs_burst/icmp_msgs_per_sec
* v4 per-host limits: net.ipv4.icmp_ratelimit/ratemask
* v6 per-host limits: net.ipv6.icmp_ratelimit/ratemask

However, when ICMP output is limited, there is no way to tell
which limit has been hit or even if the limits are responsible
for the lack of ICMP output.

Add counters for each of the cases above. As we are within
local_bh_disable(), use the __INC stats variant.

Example output:

 # nstat -sz "*RateLimit*"
 IcmpOutRateLimitGlobal          134                0.0
 IcmpOutRateLimitHost            770                0.0
 Icmp6OutRateLimitHost           84                 0.0
Signed-off-by: Jamie Bainbridge <jamie.bainbridge@gmail.com>
Suggested-by: Abhishek Rawal <rawal.abhishek92@gmail.com>
Link: https://lore.kernel.org/r/273b32241e6b7fdc5c609e6f5ebc68caf3994342.1674605770.git.jamie.bainbridge@gmail.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>

d0941130

Merge branch 'adding-sparx5-is0-vcap-support' · 9f927527

Paolo Abeni authored Jan 26, 2023

Steen Hegelund says:

====================
Adding Sparx5 IS0 VCAP support

This provides the Ingress Stage 0 (IS0) VCAP (Versatile Content-Aware
Processor) support for the Sparx5 platform.

The IS0 VCAP (also known in the datasheet as CLM) is a classifier VCAP that
mainly extracts frame information to metadata that follows the frame in the
Sparx5 processing flow all the way to the egress port.

The IS0 VCAP has 4 lookups and they are accessible with a TC chain id:

- chain 1000000: IS0 Lookup 0
- chain 1100000: IS0 Lookup 1
- chain 1200000: IS0 Lookup 2
- chain 1300000: IS0 Lookup 3
- chain 1400000: IS0 Lookup 4
- chain 1500000: IS0 Lookup 5

Each of these lookups have their own port keyset configuration that decides
which keys will be used for matching on which traffic type.

The IS0 VCAP has these traffic classifications:

- IPv4 frames
- IPv6 frames
- Unicast MPLS frames (ethertype = 0x8847)
- Multicast MPLS frames (ethertype = 0x8847)
- Other frame types than MPLS, IPv4 and IPv6

The IS0 VCAP has an action that allows setting the value of a PAG (Policy
Association Group) key field in the frame metadata, and this can be used
for matching in an IS2 VCAP rule.

This allow rules in the IS0 VCAP to be linked to rules in the IS2 VCAP.

The linking is exposed by using the TC "goto chain" action with an offset
from the IS2 chain ids.

As an example a "goto chain 8000001" will use a PAG value of 1 to chain to
a rule in IS2 Lookup 0.
====================

Link: https://lore.kernel.org/r/20230124104511.293938-1-steen.hegelund@microchip.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>

9f927527

net: microchip: sparx5: Add support for IS0 VCAP CVLAN TC keys · 52df82cc

Steen Hegelund authored Jan 24, 2023

This adds support for parsing and matching on the CVLAN tags in the Sparx5
IS0 VCAP.
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

52df82cc

net: microchip: sparx5: Add support for IS0 VCAP ethernet protocol types · 63e35645

Steen Hegelund authored Jan 24, 2023

This allows the IS0 VCAP to have its own list of supported ethernet
protocol types matching what is supported by the VCAPs port lookup
classification.
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

63e35645

net: microchip: sparx5: Add automatic selection of VCAP rule actionset · 81e164c4

Steen Hegelund authored Jan 24, 2023

With more than one possible actionset in a VCAP instance, the VCAP API will
now use the actions in a VCAP rule to select the actionset that fits these
actions the best possible way.
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

81e164c4

net: microchip: sparx5: Add TC filter chaining support for IS0 and IS2 VCAPs · 88bd9ea7

Steen Hegelund authored Jan 24, 2023

This allows rules to be chained between VCAP instances, e.g. from IS0
Lookup 0 to IS0 Lookup 1, or from one of the IS0 Lookups to one of the IS2
Lookups.

Chaining from an IS2 Lookup to another IS2 Lookup is not supported in the
hardware.
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

88bd9ea7

net: microchip: sparx5: Add TC support for IS0 VCAP · 542e6e2c

Steen Hegelund authored Jan 24, 2023

This enables the TC command to use the Sparx5 IS0 VCAP
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

542e6e2c

net: microchip: sparx5: Add actionset type id information to rule · 7306fcd1

Steen Hegelund authored Jan 24, 2023

This adds the actionset type id to the rule information.  This is needed as
we now have more than one actionset in a VCAP instance (IS0).
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

7306fcd1

net: microchip: sparx5: Add IS0 VCAP keyset configuration for Sparx5 · 545609fd

Steen Hegelund authored Jan 24, 2023

This adds the IS0 VCAP port keyset configuration for Sparx5 and also
updates the debugFS support to show the keyset configuration.
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

545609fd

net: microchip: sparx5: Add IS0 VCAP model and updated KUNIT VCAP model · f274a659

Steen Hegelund authored Jan 24, 2023

This provides the IS0 (Ingress Stage 0) or CLM VCAP model for Sparx5.
This VCAP provides classification actions for Sparx5.
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

f274a659

Merge branch 'add-ip_local_port_range-socket-option' · 3f17e16f

Jakub Kicinski authored Jan 25, 2023

Jakub Sitnicki says:

====================
Add IP_LOCAL_PORT_RANGE socket option

This patch set is a follow up to the "How to share IPv4 addresses by
partitioning the port space" talk given at LPC 2022 [1].

Please see patch #1 for the motivation & the use case description.
Patch #2 adds tests exercising the new option in various scenarios.

Documentation
-------------

Proposed update to the ip(7) man-page:

       IP_LOCAL_PORT_RANGE (since Linux X.Y)
              Set or get the per-socket default local  port  range.  This
              option  can  be  used  to  clamp down the global local port
              range, defined by the ip_local_port_range  /proc  interface
              described below, for a given socket.

              The  option  takes  an uint32_t value with the high 16 bits
              set to the upper range bound, and the low 16  bits  set  to
              the  lower  range  bound.  Range  bounds are inclusive. The
              16-bit values should be in host byte order.

              The lower bound has to be less than the  upper  bound  when
              both  bounds  are  not  zero. Otherwise, setting the option
              fails with EINVAL.

              If either bound is outside of the global local port  range,
              or is zero, then that bound has no effect.

              To  reset  the setting, pass zero as both the upper and the
              lower bound.

Interaction with SELinux bind() hook
------------------------------------

SELinux bind() hook - selinux_socket_bind() - performs a permission check
if the requested local port number lies outside of the netns ephemeral port
range.

The proposed socket option cannot be used change the ephemeral port range
to extend beyond the per-netns port range, as set by
net.ipv4.ip_local_port_range.

Hence, there is no interaction with SELinux, AFAICT.

RFC -> v1
RFC: https://lore.kernel.org/netdev/20220912225308.93659-1-jakub@cloudflare.com/

 * Allow either the high bound or the low bound, or both, to be zero
 * Add getsockopt support
 * Add selftests

Links:
------

[1]: https://lpc.events/event/16/contributions/1349/
====================

Link: https://lore.kernel.org/r/20221221-sockopt-port-range-v6-0-be255cc0e51f@cloudflare.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

3f17e16f

selftests/net: Cover the IP_LOCAL_PORT_RANGE socket option · ae543965

Jakub Sitnicki authored Jan 24, 2023

Exercise IP_LOCAL_PORT_RANGE socket option in various scenarios:

1. pass invalid values to setsockopt
2. pass a range outside of the per-netns port range
3. configure a single-port range
4. exhaust a configured multi-port range
5. check interaction with late-bind (IP_BIND_ADDRESS_NO_PORT)
6. set then get the per-socket port range
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ae543965

inet: Add IP_LOCAL_PORT_RANGE socket option · 91d0b78c

Jakub Sitnicki authored Jan 24, 2023

Users who want to share a single public IP address for outgoing connections
between several hosts traditionally reach for SNAT. However, SNAT requires
state keeping on the node(s) performing the NAT.

A stateless alternative exists, where a single IP address used for egress
can be shared between several hosts by partitioning the available ephemeral
port range. In such a setup:

1. Each host gets assigned a disjoint range of ephemeral ports.
2. Applications open connections from the host-assigned port range.
3. Return traffic gets routed to the host based on both, the destination IP
   and the destination port.

An application which wants to open an outgoing connection (connect) from a
given port range today can choose between two solutions:

1. Manually pick the source port by bind()'ing to it before connect()'ing
   the socket.

   This approach has a couple of downsides:

   a) Search for a free port has to be implemented in the user-space. If
      the chosen 4-tuple happens to be busy, the application needs to retry
      from a different local port number.

      Detecting if 4-tuple is busy can be either easy (TCP) or hard
      (UDP). In TCP case, the application simply has to check if connect()
      returned an error (EADDRNOTAVAIL). That is assuming that the local
      port sharing was enabled (REUSEADDR) by all the sockets.

        # Assume desired local port range is 60_000-60_511
        s = socket(AF_INET, SOCK_STREAM)
        s.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
        s.bind(("192.0.2.1", 60_000))
        s.connect(("1.1.1.1", 53))
        # Fails only if 192.0.2.1:60000 -> 1.1.1.1:53 is busy
        # Application must retry with another local port

      In case of UDP, the network stack allows binding more than one socket
      to the same 4-tuple, when local port sharing is enabled
      (REUSEADDR). Hence detecting the conflict is much harder and involves
      querying sock_diag and toggling the REUSEADDR flag [1].

   b) For TCP, bind()-ing to a port within the ephemeral port range means
      that no connecting sockets, that is those which leave it to the
      network stack to find a free local port at connect() time, can use
      the this port.

      IOW, the bind hash bucket tb->fastreuse will be 0 or 1, and the port
      will be skipped during the free port search at connect() time.

2. Isolate the app in a dedicated netns and use the use the per-netns
   ip_local_port_range sysctl to adjust the ephemeral port range bounds.

   The per-netns setting affects all sockets, so this approach can be used
   only if:

   - there is just one egress IP address, or
   - the desired egress port range is the same for all egress IP addresses
     used by the application.

   For TCP, this approach avoids the downsides of (1). Free port search and
   4-tuple conflict detection is done by the network stack:

     system("sysctl -w net.ipv4.ip_local_port_range='60000 60511'")

     s = socket(AF_INET, SOCK_STREAM)
     s.setsockopt(SOL_IP, IP_BIND_ADDRESS_NO_PORT, 1)
     s.bind(("192.0.2.1", 0))
     s.connect(("1.1.1.1", 53))
     # Fails if all 4-tuples 192.0.2.1:60000-60511 -> 1.1.1.1:53 are busy

  For UDP this approach has limited applicability. Setting the
  IP_BIND_ADDRESS_NO_PORT socket option does not result in local source
  port being shared with other connected UDP sockets.

  Hence relying on the network stack to find a free source port, limits the
  number of outgoing UDP flows from a single IP address down to the number
  of available ephemeral ports.

To put it another way, partitioning the ephemeral port range between hosts
using the existing Linux networking API is cumbersome.

To address this use case, add a new socket option at the SOL_IP level,
named IP_LOCAL_PORT_RANGE. The new option can be used to clamp down the
ephemeral port range for each socket individually.

The option can be used only to narrow down the per-netns local port
range. If the per-socket range lies outside of the per-netns range, the
latter takes precedence.

UAPI-wise, the low and high range bounds are passed to the kernel as a pair
of u16 values in host byte order packed into a u32. This avoids pointer
passing.

  PORT_LO = 40_000
  PORT_HI = 40_511

  s = socket(AF_INET, SOCK_STREAM)
  v = struct.pack("I", PORT_HI << 16 | PORT_LO)
  s.setsockopt(SOL_IP, IP_LOCAL_PORT_RANGE, v)
  s.bind(("127.0.0.1", 0))
  s.getsockname()
  # Local address between ("127.0.0.1", 40_000) and ("127.0.0.1", 40_511),
  # if there is a free port. EADDRINUSE otherwise.

[1] https://github.com/cloudflare/cloudflare-blog/blob/232b432c1d57/2022-02-connectx/connectx.py#L116Reviewed-by: Marek Majkowski <marek@cloudflare.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

91d0b78c

net: Kconfig: fix spellos · 6a7a2c18

Randy Dunlap authored Jan 24, 2023

Fix spelling in net/ Kconfig files.
(reported by codespell)
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Jozsef Kadlecsik <kadlec@netfilter.org>
Cc: Florian Westphal <fw@strlen.de>
Cc: coreteam@netfilter.org
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Link: https://lore.kernel.org/r/20230124181724.18166-1-rdunlap@infradead.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

6a7a2c18