Commits · 2893ba9c1d1a4f62b03db77aa8cbc00464fe41c5 · Kirill Smelkov / linux

03 Aug, 2023 40 commits

selftests: openvswitch: add basic ct test case parsing · 2893ba9c

Aaron Conole authored Aug 01, 2023

Forwarding via ct() action is an important use case for openvswitch, but
generally would require using a full ovs-vswitchd to get working. Add a
ct action parser for basic ct test case.
Signed-off-by: Aaron Conole <aconole@redhat.com>
Reviewed-by: Adrian Moreno <amorenoz@redhat.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

2893ba9c

selftests: openvswitch: add a test for ipv4 forwarding · 05398aa4

Aaron Conole authored Aug 01, 2023

This is a simple ipv4 bidirectional connectivity test.
Signed-off-by: Aaron Conole <aconole@redhat.com>
Reviewed-by: Adrian Moreno <amorenoz@redhat.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

05398aa4

selftests: openvswitch: support key masks · 9f1179fb

Adrian Moreno authored Aug 01, 2023

The default value for the mask actually depends on the value (e.g: if
the value is non-null, the default is full-mask), so change the convert
functions to accept the full, possibly masked string and let them figure
out how to parse the different values.

Also, implement size-aware int parsing.

With this patch we can now express flows such as the following:
"eth(src=0a:ca:fe:ca:fe:0a/ff:ff:00:00:ff:00)"
"eth(src=0a:ca:fe:ca:fe:0a)" -> mask = ff:ff:ff:ff:ff:ff
"ipv4(src=192.168.1.1)" -> mask = 255.255.255.255
"ipv4(src=192.168.1.1/24)"
"ipv4(src=192.168.1.1/255.255.255.0)"
"tcp(src=8080)" -> mask = 0xffff
"tcp(src=8080/0xf0f0)"
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

9f1179fb

selftests: openvswitch: add an initial flow programming case · 918423fd

Aaron Conole authored Aug 01, 2023

The openvswitch self-tests can test much of the control side of
the module (ie: what a vswitchd implementation would process),
but the actual packet forwarding cases aren't supported, making
the testing of limited value.

Add some flow parsing and an initial ARP based test case using
arping utility.  This lets us display flows, add some basic
output flows with simple matches, and test against a known good
forwarding case.
Signed-off-by: Aaron Conole <aconole@redhat.com>
Reviewed-by: Adrian Moreno <amorenoz@redhat.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

918423fd

udp6: Fix __ip6_append_data()'s handling of MSG_SPLICE_PAGES · ce650a16

David Howells authored Aug 02, 2023

__ip6_append_data() can has a similar problem to __ip_append_data()[1] when
asked to splice into a partially-built UDP message that has more than the
frag-limit data and up to the MTU limit, but in the ipv6 case, it errors
out with EINVAL.  This can be triggered with something like:

        pipe(pfd);
        sfd = socket(AF_INET6, SOCK_DGRAM, 0);
        connect(sfd, ...);
        send(sfd, buffer, 8137, MSG_CONFIRM|MSG_MORE);
        write(pfd[1], buffer, 8);
        splice(pfd[0], 0, sfd, 0, 0x4ffe0ul, 0);

where the amount of data given to send() is dependent on the MTU size (in
this instance an interface with an MTU of 8192).

The problem is that the calculation of the amount to copy in
__ip6_append_data() goes negative in two places, but a check has been put
in to give an error in this case.

This happens because when pagedlen > 0 (which happens for MSG_ZEROCOPY and
MSG_SPLICE_PAGES), the terms in:

        copy = datalen - transhdrlen - fraggap - pagedlen;

then mostly cancel when pagedlen is substituted for, leaving just -fraggap.

Fix this by:

 (1) Insert a note about the dodgy calculation of 'copy'.

 (2) If MSG_SPLICE_PAGES, clear copy if it is negative from the above
     equation, so that 'offset' isn't regressed and 'length' isn't
     increased, which will mean that length and thus copy should match the
     amount left in the iterator.

 (3) When handling MSG_SPLICE_PAGES, give a warning and return -EIO if
     we're asked to splice more than is in the iterator.  It might be
     better to not give the warning or even just give a 'short' write.

 (4) If MSG_SPLICE_PAGES, override the copy<0 check.

[!] Note that this should also affect MSG_ZEROCOPY, but that will return
-EINVAL for the range of send sizes that requires the skbuff to be split.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: David Ahern <dsahern@kernel.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: netdev@vger.kernel.org
Link: https://lore.kernel.org/r/000000000000881d0606004541d1@google.com/ [1]
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/1580952.1690961810@warthog.procyon.org.ukSigned-off-by: Paolo Abeni <pabeni@redhat.com>

ce650a16

net: gemini: Do not check for 0 return after calling platform_get_irq() · 6abce66b

Ruan Jinjie authored Aug 02, 2023

It is not possible for platform_get_irq() to return 0. Use the
return value from platform_get_irq().
Signed-off-by: Ruan Jinjie <ruanjinjie@huawei.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Link: https://lore.kernel.org/r/20230802085216.659238-1-ruanjinjie@huawei.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>

6abce66b

drivers: net: xgene: Do not check for 0 return after calling platform_get_irq() · c1e9e5e0

Ruan Jinjie authored Aug 02, 2023

It is not possible for platform_get_irq() to return 0. Use the
return value from platform_get_irq().
Signed-off-by: Ruan Jinjie <ruanjinjie@huawei.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Link: https://lore.kernel.org/r/20230802090657.969923-1-ruanjinjie@huawei.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>

c1e9e5e0

tipc: Remove unused function declarations · c956910d

Yue Haibing authored Aug 02, 2023

Commit d50ccc2d ("tipc: add 128-bit node identifier") declared but never
implemented tipc_node_id2hash().
Also commit 5c216e1d ("tipc: Allow run-time alteration of default link settings")
never implemented tipc_media_set_priority() and tipc_media_set_window(),
commit cad2929d ("tipc: update a binding service via broadcast") only declared
tipc_named_bcast().

Since commit be07f056 ("tipc: simplify the finalize work queue")
tipc_sched_net_finalize() is removed and declaration is unused.
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20230802034659.39840-1-yuehaibing@huawei.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>

c956910d

net: ethernet: mtk_eth_soc: support per-flow accounting on MT7988 · 571e9c49

Daniel Golle authored Aug 02, 2023

NETSYS_V3 uses 64 bits for each counters while older SoCs are using
48/40 bits for each counter.
Support reading per-flow byte and package counters on NETSYS_V3.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/37a0928fa8c1253b197884c68ce1f54239421ac5.1690946442.git.daniel@makrotopia.orgSigned-off-by: Paolo Abeni <pabeni@redhat.com>

571e9c49

bonding: support balance-alb with openvswitch · f11e5bd1

Mateusz Kowalski authored Aug 01, 2023

Commit d5410ac7 ("net:bonding:support balance-alb interface with
vlan to bridge") introduced a support for balance-alb mode for
interfaces connected to the linux bridge by fixing missing matching of
MAC entry in FDB. In our testing we discovered that it still does not
work when the bond is connected to the OVS bridge as show in diagram
below:

eth1(mac:eth1_mac)--bond0(balance-alb,mac:eth0_mac)--eth0(mac:eth0_mac)
                         |
                       bond0.150(mac:eth0_mac)
                         |
                       ovs_bridge(ip:bridge_ip,mac:eth0_mac)

This patch fixes it by checking not only if the device is a bridge but
also if it is an openvswitch.
Signed-off-by: Mateusz Kowalski <mko@redhat.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/9fe7297c-609e-208b-c77b-3ceef6eb51a4@redhat.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>

f11e5bd1

Merge branch 'introduce-ndo_hwtstamp_get-and-ndo_hwtstamp_set' · b23ec2bd

Jakub Kicinski authored Aug 02, 2023

Vladimir Oltean says:

====================
Introduce ndo_hwtstamp_get() and ndo_hwtstamp_set()

Based on previous RFCs from Maxim Georgiev:
https://lore.kernel.org/netdev/20230502043150.17097-1-glipus@gmail.com/

this series attempts to introduce new API for the hardware timestamping
control path (SIOCGHWTSTAMP and SIOCSHWTSTAMP handling).

I don't have any board with phylib hardware timestamping, so I would
appreciate testing (especially on lan966x, the most intricate
conversion). I was, however, able to test netdev level timestamping,
because I also have some more unsubmitted conversions in progress:

https://github.com/vladimiroltean/linux/commits/ndo-hwtstamp-v9

I hope that the concerns expressed in the comments of previous series
were addressed, and that Köry Maincent's series:
https://lore.kernel.org/netdev/20230406173308.401924-1-kory.maincent@bootlin.com/
can make progress in parallel with the conversion of the rest of drivers.
====================

Link: https://lore.kernel.org/r/20230801142824.1772134-1-vladimir.oltean@nxp.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

b23ec2bd

net: remove phy_has_hwtstamp() -> phy_mii_ioctl() decision from converted drivers · fd770e85

Vladimir Oltean authored Aug 01, 2023

It is desirable that the new .ndo_hwtstamp_set() API gives more
uniformity, less overhead and future flexibility w.r.t. the PHY
timestamping behavior.

Currently there are some drivers which allow PHY timestamping through
the procedure mentioned in Documentation/networking/timestamping.rst.
They don't do anything locally if phy_has_hwtstamp() is set, except for
lan966x which installs PTP packet traps.

Centralize that behavior in a new dev_set_hwtstamp_phylib() code
function, which calls either phy_mii_ioctl() for the phylib PHY,
or .ndo_hwtstamp_set() of the netdev, based on a single policy
(currently simplistic: phy_has_hwtstamp()).

Any driver converted to .ndo_hwtstamp_set() will automatically opt into
the centralized phylib timestamping policy. Unconverted drivers still
get to choose whether they let the PHY handle timestamping or not.

Netdev drivers with integrated PHY drivers that don't use phylib
presumably don't set dev->phydev, and those will always see
HWTSTAMP_SOURCE_NETDEV requests even when converted. The timestamping
policy will remain 100% up to them.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Link: https://lore.kernel.org/r/20230801142824.1772134-13-vladimir.oltean@nxp.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

fd770e85

net: phy: provide phylib stubs for hardware timestamping operations · 60495b66

Vladimir Oltean authored Aug 01, 2023

net/core/dev_ioctl.c (built-in code) will want to call phy_mii_ioctl()
for hardware timestamping purposes. This is not directly possible,
because phy_mii_ioctl() is a symbol provided under CONFIG_PHYLIB.

Do something similar to what was done in DSA in commit 5a178186
("net: dsa: replace NETDEV_PRE_CHANGE_HWTSTAMP notifier with a stub"),
and arrange some indirect calls to phy_mii_ioctl() through a stub
structure containing function pointers, that's provided by phylib as
built-in even when CONFIG_PHYLIB=m, and which phy_init() populates at
runtime (module insertion).

Note: maybe the ownership of the ethtool_phy_ops singleton is backwards,
and the methods exposed by that should be later merged into phylib_stubs.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20230801142824.1772134-12-vladimir.oltean@nxp.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

60495b66

net: transfer rtnl_lock() requirement from ethtool_set_ethtool_phy_ops() to caller · 70ef7d87

Vladimir Oltean authored Aug 01, 2023

phy_init() and phy_exit() will have to do more stuff under rtnl_lock()
in a future change. Since rtnl_unlock() -> netdev_run_todo() does a lot
of stuff under the hood, it's a pity to lock and unlock the rtnetlink
mutex twice in a row.

Change the calling convention such that the only caller of
ethtool_set_ethtool_phy_ops(), phy_device.c, provides a context where
the rtnl_mutex is already acquired.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20230801142824.1772134-11-vladimir.oltean@nxp.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

70ef7d87

net: lan966x: convert to ndo_hwtstamp_get() and ndo_hwtstamp_set() · 54e1ed69

Vladimir Oltean authored Aug 01, 2023

The hardware timestamping through ndo_eth_ioctl() is going away.
Convert the lan966x driver to the new API before that can be removed.

After removing the timestamping logic from lan966x_port_ioctl(), the
rest is equivalent to phy_do_ioctl().
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Link: https://lore.kernel.org/r/20230801142824.1772134-10-vladimir.oltean@nxp.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

54e1ed69

net: sparx5: convert to ndo_hwtstamp_get() and ndo_hwtstamp_set() · 7bdde444

Vladimir Oltean authored Aug 01, 2023

The hardware timestamping through ndo_eth_ioctl() is going away.
Convert the sparx5 driver to the new API before that can be removed.

After removing the timestamping logic from sparx5_port_ioctl(), the rest
is equivalent to phy_do_ioctl().
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Steen Hegelund <Steen.Hegelund@microchip.com>
Link: https://lore.kernel.org/r/20230801142824.1772134-9-vladimir.oltean@nxp.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

7bdde444

net: fec: delete fec_ptp_disable_hwts() · 547b006d

Vladimir Oltean authored Aug 01, 2023

Commit 34074639 ("net: fec: fix hardware time stamping by external
devices") was overly cautious with calling fec_ptp_disable_hwts() when
cmd == SIOCSHWTSTAMP and use_fec_hwts == false, because use_fec_hwts is
based on a runtime invariant (phy_has_hwtstamp()). Thus, if use_fec_hwts
is false, then fep->hwts_tx_en and fep->hwts_rx_en cannot be changed at
runtime; their values depend on the initial memory allocation, which
already sets them to zeroes.

If the core will ever gain support for switching timestamping layers,
it will arrange for a more organized calling convention and disable
timestamping in the previous layer as a first step. This means that the
code in the FEC driver is not necessary in any case.

The purpose of this change is to arrange the phy_has_hwtstamp() code in
a way in which it can be refactored away into generic logic.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Wei Fang <wei.fang@nxp.com>
Link: https://lore.kernel.org/r/20230801142824.1772134-8-vladimir.oltean@nxp.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

547b006d

net: fec: convert to ndo_hwtstamp_get() and ndo_hwtstamp_set() · ef5eb9c5

Vladimir Oltean authored Aug 01, 2023

The hardware timestamping through ndo_eth_ioctl() is going away.
Convert the FEC driver to the new API before that can be removed.

After removing the timestamping logic from fec_enet_ioctl(), the rest
is equivalent to phy_do_ioctl_running().
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Wei Fang <wei.fang@nxp.com>
Link: https://lore.kernel.org/r/20230801142824.1772134-7-vladimir.oltean@nxp.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

ef5eb9c5

net: bonding: convert to ndo_hwtstamp_get() / ndo_hwtstamp_set() · c0dabeb4

Maxim Georgiev authored Aug 01, 2023

bonding is one of the stackable net devices which pass the hardware
timestamping ops to the real device through ndo_eth_ioctl(). This
prevents converting any device driver to the new hwtimestamping API
without regressions.

Remove that limitation in bonding by using the newly introduced helpers
for timestamping through lower devices, that handle both the new and the
old driver API.
Signed-off-by: Maxim Georgiev <glipus@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20230801142824.1772134-6-vladimir.oltean@nxp.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

c0dabeb4

net: macvlan: convert to ndo_hwtstamp_get() / ndo_hwtstamp_set() · 0bca3f7f

Maxim Georgiev authored Aug 01, 2023

macvlan is one of the stackable net devices which pass the hardware
timestamping ops to the real device through ndo_eth_ioctl(). This
prevents converting any device driver to the new hwtimestamping API
without regressions.

Remove that limitation in macvlan by using the newly introduced helpers
for timestamping through lower devices, that handle both the new and the
old driver API.

macvlan only implements ndo_eth_ioctl() for these 2 operations, so
delete that method.
Signed-off-by: Maxim Georgiev <glipus@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20230801142824.1772134-5-vladimir.oltean@nxp.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

0bca3f7f

net: vlan: convert to ndo_hwtstamp_get() / ndo_hwtstamp_set() · 65c9fde1

Maxim Georgiev authored Aug 01, 2023

8021q is one of the stackable net devices which pass the hardware
timestamping ops to the real device through ndo_eth_ioctl(). This
prevents converting any device driver to the new hwtimestamping API
without regressions.

Remove that limitation in the vlan driver by using the newly introduced
helpers for timestamping through lower devices, that handle both the new
and the old driver API.
Signed-off-by: Maxim Georgiev <glipus@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20230801142824.1772134-4-vladimir.oltean@nxp.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

65c9fde1

net: add hwtstamping helpers for stackable net devices · e47d01fe

Maxim Georgiev authored Aug 01, 2023

The stackable net devices with hwtstamping support (vlan, macvlan,
bonding) only pass the hwtstamping ops to the lower (real) device.

These drivers are the first that need to be converted to the new
timestamping API, because if they aren't prepared to handle that,
then no real device driver cannot be converted to the new API either.

After studying what vlan_dev_ioctl(), macvlan_eth_ioctl() and
bond_eth_ioctl() have in common, here we propose two generic
implementations of ndo_hwtstamp_get() and ndo_hwtstamp_set() which
can be called by those 3 drivers, with "dev" being their lower device.

These helpers cover both cases, when the lower driver is converted to
the new API or unconverted.

We need some hacks in case of an unconverted driver, namely to stuff
some pointers in struct kernel_hwtstamp_config which shouldn't have
been there (since the new API isn't supposed to need it). These will
be removed when all drivers will have been converted to the new API.
Signed-off-by: Maxim Georgiev <glipus@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20230801142824.1772134-3-vladimir.oltean@nxp.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

e47d01fe

net: add NDOs for configuring hardware timestamping · 66f72230

Maxim Georgiev authored Aug 01, 2023

Current hardware timestamping API for NICs requires implementing
.ndo_eth_ioctl() for SIOCGHWTSTAMP and SIOCSHWTSTAMP.

That API has some boilerplate such as request parameter translation
between user and kernel address spaces, handling possible translation
failures correctly, etc. Since it is the same all across the board, it
would be desirable to handle it through generic code.

Here we introduce .ndo_hwtstamp_get() and .ndo_hwtstamp_set(), which
implement that boilerplate and allow drivers to just act upon requests.
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Maxim Georgiev <glipus@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Link: https://lore.kernel.org/r/20230801142824.1772134-2-vladimir.oltean@nxp.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

66f72230

Merge branch 'net-extend-alloc_skb_with_frags-max-size' · 72c1a284

Jakub Kicinski authored Aug 02, 2023

Eric Dumazet says:

====================
net: extend alloc_skb_with_frags() max size

alloc_skb_with_frags(), while being able to use high order allocations,
limits the payload size to PAGE_SIZE * MAX_SKB_FRAGS

Reviewing Tahsin Erdogan patch [1], it was clear to me we need
to remove this limitation.

[1] https://lore.kernel.org/netdev/20230731230736.109216-1-trdgn@amazon.com/

v2: Addressed Willem feedback on 1st patch.
====================

Link: https://lore.kernel.org/r/20230801205254.400094-1-edumazet@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

72c1a284

net: tap: change tap_alloc_skb() to allow bigger paged allocations · 37dfe5b8

Eric Dumazet authored Aug 01, 2023

tap_alloc_skb() is currently calling sock_alloc_send_pskb()
forcing order-0 page allocations.

Switch to PAGE_ALLOC_COSTLY_ORDER, to increase max size by 8x.

Also add logic to increase the linear part if needed.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tahsin Erdogan <trdgn@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20230801205254.400094-5-edumazet@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

37dfe5b8

net/packet: change packet_alloc_skb() to allow bigger paged allocations · ae6db08f

Eric Dumazet authored Aug 01, 2023

packet_alloc_skb() is currently calling sock_alloc_send_pskb()
forcing order-0 page allocations.

Switch to PAGE_ALLOC_COSTLY_ORDER, to increase max size by 8x.

Also add logic to increase the linear part if needed.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tahsin Erdogan <trdgn@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20230801205254.400094-4-edumazet@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

ae6db08f

net: tun: change tun_alloc_skb() to allow bigger paged allocations · ce7c7fef

Eric Dumazet authored Aug 01, 2023

tun_alloc_skb() is currently calling sock_alloc_send_pskb()
forcing order-0 page allocations.

Switch to PAGE_ALLOC_COSTLY_ORDER, to increase max allocation size by 8x.

Also add logic to increase the linear part if needed.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tahsin Erdogan <trdgn@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20230801205254.400094-3-edumazet@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

ce7c7fef

net: allow alloc_skb_with_frags() to allocate bigger packets · 09c2c907

Eric Dumazet authored Aug 01, 2023

Refactor alloc_skb_with_frags() to allow bigger packets allocations.

Instead of assuming that only order-0 allocations will be attempted,
use the caller supplied max order.

v2: try harder to use high-order pages, per Willem feedback.

Link: https://lore.kernel.org/netdev/CANn89iJQfmc_KeUr3TeXvsLQwo3ZymyoCr7Y6AnHrkWSuz0yAg@mail.gmail.com/Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tahsin Erdogan <trdgn@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20230801205254.400094-2-edumazet@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

09c2c907

sctp: Remove unused function declarations · 49c467dc

Yue Haibing authored Jul 31, 2023

These declarations are never implemented since beginning of git history.
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Acked-by: Xin Long <lucien.xin@gmail.com>
Link: https://lore.kernel.org/r/20230731141030.32772-1-yuehaibing@huawei.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

49c467dc

Merge branch 'mlx5-ipsec-packet-offload-support-in-eswitch-mode' · edd8b295

Jakub Kicinski authored Aug 02, 2023

Leon Romanovsky says:

====================
mlx5 IPsec packet offload support in eswitch mode

This series from Jianbo adds mlx5 IPsec packet offload support in eswitch
offloaded mode.

It works exactly like "regular" IPsec, nothing special, except
now users can switch to switchdev before adding IPsec rules.

 devlink dev eswitch set pci/0000:06:00.0 mode switchdev

Same configurations as here:

https://lore.kernel.org/netdev/cover.1670005543.git.leonro@nvidia.com/

Packet offload mode:
  ip xfrm state offload packet dev <if-name> dir <in|out>
  ip xfrm policy .... offload packet dev <if-name>
Crypto offload mode:
  ip xfrm state offload crypto dev <if-name> dir <in|out>
or (backward compatibility)
  ip xfrm state offload dev <if-name> dir <in|out>

v0: https://lore.kernel.org/all/cover.1689064922.git.leonro@nvidia.com
====================

Link: https://lore.kernel.org/r/cover.1690802064.git.leon@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

edd8b295

net/mlx5e: Make TC and IPsec offloads mutually exclusive on a netdev · c8e350e6

Jianbo Liu authored Jul 31, 2023

For IPsec packet offload mode, the order of TC offload and IPsec
offload on the same netdevice is not aligned with the order in the
non-offload software. For example, for RX, the software performs TC
first and then IPsec transformation, but the implementation for
offload does that in the opposite way.

To resolve the difference for now, either IPsec offload or TC offload,
not both, is allowed for a specific interface.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/8e2e5e3b0984d785066e8663aaf97b3ba1bb873f.1690802064.git.leon@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

c8e350e6

net/mlx5e: Add get IPsec offload stats for uplink representor · 6e56ab1c

Jianbo Liu authored Jul 31, 2023

As IPsec offload is supported in switchdev mode, HW stats can be can be
obtained from uplink rep.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/b43c91c452f1db9c35c10639a029aa10fd8b7895.1690802064.git.leon@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

6e56ab1c

net/mlx5e: Modify and restore TC rules for IPSec TX rules · d1569537

Jianbo Liu authored Jul 31, 2023

After IPsec policy/state TX rules are added, any TC flow rule, which
forwards packets to uplink, is modified to forward to IPsec TX tables.
As these tables are destroyed dynamically, whenever there is no
reference to them, the destinations of this kind of rules must be
restored to uplink.

There is a special case for packet encapsulation, as the
packet_reformat_id in the extended destination is used to reformat
packets, but only for the VPORT destination. To forward packet to
IPsec table and do encapsulation in one FTE, move the
packet_reformat_id to flow context, instead of using the extended
destination. As a limitation, multiple encapsulations with table
forwarding, and one together with other VPORT destinations, are not
allowed, so add a check when offloading TC rules.

TC rules are not allowed before IPsec TX rule is added, so only need
to restore TC rules after flush IPSec TX rules. As they are saved in
the vport_rep rhashtables, we walk all the rules in the rhashtables,
and find TC rules with destinations pointing to IPsec tables, and
modify them one by one. To avoid concurrent issue, this handling is
done under the protection of eswitch mode_lock.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/7bcb2c7e2ecf0e0d06b095c8dcc6a37ea7f02faf.1690802064.git.leon@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

d1569537

net/mlx5e: Make IPsec offload work together with eswitch and TC · 366e4624

Jianbo Liu authored Jul 31, 2023

The eswitch mode is not allowed to change if there are any IPsec rules.
Besides, by using mlx5_esw_try_lock() to get eswitch mode lock, IPsec
rules are not allowed to be offloaded if there are any TC rules.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/e442b512b21a931fbdfb87d57ae428c37badd58a.1690802064.git.leon@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

366e4624

net/mlx5: Compare with old_dest param to modify rule destination · 1632649d

Jianbo Liu authored Jul 31, 2023

The rule destination must be compared with the old_dest passed in.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/24adc60d05d7492359ba343c6da1ebbe9fe284f6.1690802064.git.leon@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

1632649d

net/mlx5e: Support IPsec packet offload for TX in switchdev mode · c6c2bf5d

Jianbo Liu authored Jul 31, 2023

The IPsec encryption is done at the last, so add new prio for IPsec
offload in FDB, and put it just lower than the slow path prio and
higher than the per-vport prio.
Three levels are added for TX. The first one is for ip xfrm policy.
The sa table is created in the second level for ip xfrm state. The
status table is created at the last to count the number of packets
encrypted.
The rules, which forward packets to uplink, are changed to forward
them to IPsec TX tables first. These rules are restored after those
tables are destroyed, which is done immediately when there is no
reference to them, just as what does in legacy mode. The support for
slow path is added here, by refreshing uplink's channels. But, the
handling for TC fast path, which is more complicated, will be added
later. Besides, reg c4 is used instead to match reqid.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/cfd0e6ffaf0b8c55ebaa9fb0649b7c504b6b8ec6.1690802064.git.leon@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

c6c2bf5d

net/mlx5e: Refactor IPsec TX tables creation · f46e92d6

Jianbo Liu authored Jul 31, 2023

Add attribute for IPsec TX creation, pass all needed parameters in it,
so tx_create() can be used by eswitch.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/24d5ab988b0db2d39b7fde321b44ffe885d47828.1690802064.git.leon@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

f46e92d6

net/mlx5e: Handle IPsec offload for RX datapath in switchdev mode · 91bafc63

Jianbo Liu authored Jul 31, 2023

Reuse tun opts bits in reg c1, to pass IPsec obj id to datapath.
As this is only for RX SA and there are only 11 bits, xarray is used
to map IPsec obj id to an index, which is between 1 and 0x7ff, and
replace obj id to write to reg c1.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/43d60fbcc9cd672a97d7e2a2f7fe6a3d9e9a776d.1690802064.git.leon@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

91bafc63

net/mlx5e: Support IPsec packet offload for RX in switchdev mode · 1762f132

Jianbo Liu authored Jul 31, 2023

As decryption must be done first, add new prio for IPsec offload in
FDB, and put it just lower than BYPASS prio and higher than TC prio.
Three levels are added for RX. The first one is for ip xfrm policy. SA
table is created in the second level for ip xfrm state. The status
table is created in the last to check the decryption result. If
success, packets continue with the next process, or dropped otherwise.
For now, the set of reg c1 is removed for swtichdev mode, and the
datapath process will be added in the next patch.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/c91063554cf643fb50b99cf093e8a9bf11729de5.1690802064.git.leon@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

1762f132

net/mlx5e: Refactor IPsec RX tables creation and destruction · 6e125265

Jianbo Liu authored Jul 31, 2023

Add attribute for IPsec RX creation, so rx_create() can be used by
eswitch in later patch. And move the code for TTC dest
connect/disconnect, which are needed only in NIC mode, to individual
functions.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/87478d928479b6a4eee41901204546ea05741815.1690802064.git.leon@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

6e125265