Commits · 23710925928310ec481fc0909a4d44ef89f4241a · Kirill Smelkov / linux

23 Apr, 2024 33 commits

selftests: drv-net: test dumping qstats per device · 23710925

Jakub Kicinski authored Apr 19, 2024

Add a test for dumping qstats device by device.

ksft framework grows a ksft_raises() helper, to be used
under with, which should be familiar to unittest users.

Link: https://lore.kernel.org/r/20240420023543.3300306-5-kuba@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

23710925

netlink: support all extack types in dumps · 8af4f604

Jakub Kicinski authored Apr 19, 2024

Note that when this commit message refers to netlink dump
it only means the actual dumping part, the parsing / dump
start is handled by the same code as "doit".

Commit 4a19edb6 ("netlink: Pass extack to dump handlers")
added support for returning extack messages from dump handlers,
but left out other extack info, e.g. bad attribute.

This used to be fine because until YNL we had little practical
use for the machine readable attributes, and only messages were
used in practice.

YNL flips the preference 180 degrees, it's now much more useful
to point to a bad attr with NL_SET_BAD_ATTR() than type
an English message saying "attribute XYZ is $reason-why-bad".

Support all of extack. The fact that extack only gets added if
it fits remains unaddressed.
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240420023543.3300306-4-kuba@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

8af4f604

netlink: move extack writing helpers · 652332e3

Jakub Kicinski authored Apr 19, 2024

Next change will need them in netlink_dump_done(), pure move.
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240420023543.3300306-3-kuba@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

652332e3

netdev: support dumping a single netdev in qstats · ce05d0f2

Jakub Kicinski authored Apr 19, 2024

Having to filter the right ifindex in the tests is a bit tedious.
Add support for dumping qstats for a single ifindex.
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240420023543.3300306-2-kuba@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

ce05d0f2

af_unix: Don't access successor in unix_del_edges() during GC. · 1af2dfac

Kuniyuki Iwashima authored Apr 19, 2024

syzbot reported use-after-free in unix_del_edges().  [0]

What the repro does is basically repeat the following quickly.

  1. pass a fd of an AF_UNIX socket to itself

    socketpair(AF_UNIX, SOCK_DGRAM, 0, [3, 4]) = 0
    sendmsg(3, {..., msg_control=[{cmsg_len=20, cmsg_level=SOL_SOCKET,
                                   cmsg_type=SCM_RIGHTS, cmsg_data=[4]}], ...}, 0) = 0

  2. pass other fds of AF_UNIX sockets to the socket above

    socketpair(AF_UNIX, SOCK_SEQPACKET, 0, [5, 6]) = 0
    sendmsg(3, {..., msg_control=[{cmsg_len=48, cmsg_level=SOL_SOCKET,
                                   cmsg_type=SCM_RIGHTS, cmsg_data=[5, 6]}], ...}, 0) = 0

  3. close all sockets

Here, two skb are created, and every unix_edge->successor is the first
socket.  Then, __unix_gc() will garbage-collect the two skb:

  (a) free skb with self-referencing fd
  (b) free skb holding other sockets

After (a), the self-referencing socket will be scheduled to be freed
later by the delayed_fput() task.

syzbot repeated the sequences above (1. ~ 3.) quickly and triggered
the task concurrently while GC was running.

So, at (b), the socket was already freed, and accessing it was illegal.

unix_del_edges() accesses the receiver socket as edge->successor to
optimise GC.  However, we should not do it during GC.

Garbage-collecting sockets does not change the shape of the rest
of the graph, so we need not call unix_update_graph() to update
unix_graph_grouped when we purge skb.

However, if we clean up all loops in the unix_walk_scc_fast() path,
unix_graph_maybe_cyclic remains unchanged (true), and __unix_gc()
will call unix_walk_scc_fast() continuously even though there is no
socket to garbage-collect.

To keep that optimisation while fixing UAF, let's add the same
updating logic of unix_graph_maybe_cyclic in unix_walk_scc_fast()
as done in unix_walk_scc() and __unix_walk_scc().

Note that when unix_del_edges() is called from other places, the
receiver socket is always alive:

  - sendmsg: the successor's sk_refcnt is bumped by sock_hold()
             unix_find_other() for SOCK_DGRAM, connect() for SOCK_STREAM

  - recvmsg: the successor is the receiver, and its fd is alive

[0]:
BUG: KASAN: slab-use-after-free in unix_edge_successor net/unix/garbage.c:109 [inline]
BUG: KASAN: slab-use-after-free in unix_del_edge net/unix/garbage.c:165 [inline]
BUG: KASAN: slab-use-after-free in unix_del_edges+0x148/0x630 net/unix/garbage.c:237
Read of size 8 at addr ffff888079c6e640 by task kworker/u8:6/1099

CPU: 0 PID: 1099 Comm: kworker/u8:6 Not tainted 6.9.0-rc4-next-20240418-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024
Workqueue: events_unbound __unix_gc
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0x241/0x360 lib/dump_stack.c:114
 print_address_description mm/kasan/report.c:377 [inline]
 print_report+0x169/0x550 mm/kasan/report.c:488
 kasan_report+0x143/0x180 mm/kasan/report.c:601
 unix_edge_successor net/unix/garbage.c:109 [inline]
 unix_del_edge net/unix/garbage.c:165 [inline]
 unix_del_edges+0x148/0x630 net/unix/garbage.c:237
 unix_destroy_fpl+0x59/0x210 net/unix/garbage.c:298
 unix_detach_fds net/unix/af_unix.c:1811 [inline]
 unix_destruct_scm+0x13e/0x210 net/unix/af_unix.c:1826
 skb_release_head_state+0x100/0x250 net/core/skbuff.c:1127
 skb_release_all net/core/skbuff.c:1138 [inline]
 __kfree_skb net/core/skbuff.c:1154 [inline]
 kfree_skb_reason+0x16d/0x3b0 net/core/skbuff.c:1190
 __skb_queue_purge_reason include/linux/skbuff.h:3251 [inline]
 __skb_queue_purge include/linux/skbuff.h:3256 [inline]
 __unix_gc+0x1732/0x1830 net/unix/garbage.c:575
 process_one_work kernel/workqueue.c:3218 [inline]
 process_scheduled_works+0xa2c/0x1830 kernel/workqueue.c:3299
 worker_thread+0x86d/0xd70 kernel/workqueue.c:3380
 kthread+0x2f0/0x390 kernel/kthread.c:389
 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
 </TASK>

Allocated by task 14427:
 kasan_save_stack mm/kasan/common.c:47 [inline]
 kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
 unpoison_slab_object mm/kasan/common.c:312 [inline]
 __kasan_slab_alloc+0x66/0x80 mm/kasan/common.c:338
 kasan_slab_alloc include/linux/kasan.h:201 [inline]
 slab_post_alloc_hook mm/slub.c:3897 [inline]
 slab_alloc_node mm/slub.c:3957 [inline]
 kmem_cache_alloc_noprof+0x135/0x290 mm/slub.c:3964
 sk_prot_alloc+0x58/0x210 net/core/sock.c:2074
 sk_alloc+0x38/0x370 net/core/sock.c:2133
 unix_create1+0xb4/0x770
 unix_create+0x14e/0x200 net/unix/af_unix.c:1034
 __sock_create+0x490/0x920 net/socket.c:1571
 sock_create net/socket.c:1622 [inline]
 __sys_socketpair+0x33e/0x720 net/socket.c:1773
 __do_sys_socketpair net/socket.c:1822 [inline]
 __se_sys_socketpair net/socket.c:1819 [inline]
 __x64_sys_socketpair+0x9b/0xb0 net/socket.c:1819
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

Freed by task 1805:
 kasan_save_stack mm/kasan/common.c:47 [inline]
 kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
 kasan_save_free_info+0x40/0x50 mm/kasan/generic.c:579
 poison_slab_object+0xe0/0x150 mm/kasan/common.c:240
 __kasan_slab_free+0x37/0x60 mm/kasan/common.c:256
 kasan_slab_free include/linux/kasan.h:184 [inline]
 slab_free_hook mm/slub.c:2190 [inline]
 slab_free mm/slub.c:4393 [inline]
 kmem_cache_free+0x145/0x340 mm/slub.c:4468
 sk_prot_free net/core/sock.c:2114 [inline]
 __sk_destruct+0x467/0x5f0 net/core/sock.c:2208
 sock_put include/net/sock.h:1948 [inline]
 unix_release_sock+0xa8b/0xd20 net/unix/af_unix.c:665
 unix_release+0x91/0xc0 net/unix/af_unix.c:1049
 __sock_release net/socket.c:659 [inline]
 sock_close+0xbc/0x240 net/socket.c:1421
 __fput+0x406/0x8b0 fs/file_table.c:422
 delayed_fput+0x59/0x80 fs/file_table.c:445
 process_one_work kernel/workqueue.c:3218 [inline]
 process_scheduled_works+0xa2c/0x1830 kernel/workqueue.c:3299
 worker_thread+0x86d/0xd70 kernel/workqueue.c:3380
 kthread+0x2f0/0x390 kernel/kthread.c:389
 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244

The buggy address belongs to the object at ffff888079c6e000
 which belongs to the cache UNIX of size 1920
The buggy address is located 1600 bytes inside of
 freed 1920-byte region [ffff888079c6e000, ffff888079c6e780)

Reported-by: syzbot+f3f3eef1d2100200e593@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=f3f3eef1d2100200e593
Fixes: 77e5593a ("af_unix: Skip GC if no cycle exists.")
Fixes: fd863448 ("af_unix: Try not to hold unix_gc_lock during accept().")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20240419235102.31707-1-kuniyu@amazon.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>

1af2dfac

Merge branch 'net-ipa-eight-simple-cleanups' · 0ff1db48

Paolo Abeni authored Apr 23, 2024

Alex Elder says:

====================
net: ipa: eight simple cleanups

This series contains a mix of cleanups, some dating back to
December, 2022.  Version 1 was based on an older version of
net-next/main; this version has simply been rebased.

The first two make it so the IPA SUSPEND interrupt only gets enabled
when necessary.  That make it possible in the third patch to call
device_init_wakeup() during an earlier phase of initialization, and
remove two functions.

The next patch removes IPA register definitions that are never used.
The fifth patch makes ipa_table_hash_support() a real function, so
the IPA structure only needs to be declared rather than defined when
that file is parsed.

The sixth patch fixes improper argument names in two function
declarations.  The seventh removes the declaration for a function
that does not exist, and makes ipa_cmd_init() actually get called.
And the last one eliminates ipa_version_supported(), in favor of
just deciding that if a device is probed because its compatible
matches, that device is assumed to be supported.
====================

Link: https://lore.kernel.org/r/20240419151800.2168903-1-elder@linaro.orgSigned-off-by: Paolo Abeni <pabeni@redhat.com>

0ff1db48

net: ipa: kill ipa_version_supported() · dfdd70e2

Alex Elder authored Apr 19, 2024

The only place ipa_version_supported() is called is in the probe
function.  The version comes from the match data.  Rather than
checking the version validity separately, just consider anything
that has match data to be supported.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

dfdd70e2

net: ipa: fix two minor ipa_cmd problems · 319b6d4e

Alex Elder authored Apr 19, 2024

In "ipa_cmd.h", ipa_cmd_data_valid() is declared, but that function
does not exist.  So delete that declaration.

Also, for some reason ipa_cmd_init() never gets called.  It isn't
really critical--it just validates that some memory offsets and a
size can be represented in some register fields, and they won't fail
with current data.  Regardless, call the function in ipa_probe().
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

319b6d4e

net: ipa: fix two bogus argument names · f2e4e9ea

Alex Elder authored Apr 19, 2024

In "ipa_endpoint.h", two function declarations have bogus argument
names.  Fix these.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

f2e4e9ea

net: ipa: make ipa_table_hash_support() a real function · b81565b7

Alex Elder authored Apr 19, 2024

With the exception of ipa_table_hash_support(), nothing defined in
"ipa_table.h" requires the full definition of the IPA structure.

Change that function to be a "real" function rather than an inline,
to avoid requring the IPA structure to be defined.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

b81565b7

net: ipa: remove unneeded FILT_ROUT_HASH_EN definitions · 5043d6b1

Alex Elder authored Apr 19, 2024

The FILT_ROUT_HASH_EN register is only used for IPA v4.2.  There,
routing and filter table hashing are not supported, and so the
register must be written to disable the feature.  No other version
uses this register, so its definition can be removed.  If we need to
use these some day (for example, explicitly enable the feature) this
commit can be reverted.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

5043d6b1

net: ipa: call device_init_wakeup() earlier · 19790951

Alex Elder authored Apr 19, 2024

Currently, enabling wakeup for the IPA device doesn't occur until
the setup phase of initialization (in ipa_power_setup()).

There is no need to delay doing that, however.  We can conveniently
do it during the config phase, in ipa_interrupt_config(), where we
enable power management wakeup mode for the IPA interrupt.

Moving the device_init_wakeup() out of ipa_power_setup() leaves that
function empty, so it can just be eliminated.

Similarly, rearrange all of the matching inverse calls, disabling
device wakeup in ipa_interrupt_deconfig() and removing that function
as well.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

19790951

net: ipa: only enable the SUSPEND IPA interrupt when needed · 6f370026

Alex Elder authored Apr 19, 2024

Only enable the SUSPEND IPA interrupt type when at least one
endpoint has that interrupt enabled.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

6f370026

net: ipa: maintain bitmap of suspend-enabled endpoints · 2eca7344

Alex Elder authored Apr 19, 2024

Keep track of which endpoints have the SUSPEND IPA interrupt enabled
in a variable-length bitmap.  This will be used in the next patch to
allow the SUSPEND interrupt type to be disabled except when needed.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

2eca7344

Merge branch 'net-stmmac-fix-mac-capabilities-procedure' · 57f15912

Paolo Abeni authored Apr 23, 2024

Serge Semin says:

====================
net: stmmac: Fix MAC-capabilities procedure

The series got born as a result of the discussions around the recent
Yanteng' series adding the Loongson LS7A1000, LS2K1000, LS7A2000, LS2K2000
MACs support:
Link: https://lore.kernel.org/netdev/fu3f6uoakylnb6eijllakeu5i4okcyqq7sfafhp5efaocbsrwe@w74xe7gb6x7p

In particular the Yanteng' patchset needed to implement the Loongson
MAC-specific constraints applied to the link speed and link duplex mode.
As a result of the discussion with Russel the next preliminary patch was
born:
Link: https://lore.kernel.org/netdev/df31e8bcf74b3b4ddb7ddf5a1c371390f16a2ad5.1712917541.git.siyanteng@loongson.cn

The patch above was a temporal solution utilized by Yanteng for further
developments and to move on with the on-going review. This patchset is a
refactored version of that single patch with formatting required for the
fixes patches.

The main part of the series has already been merged in on v1 stage. The
leftover is the cleanup patches which rename
stmmac_ops::phylink_get_caps() callback to stmmac_ops::update_caps() and
move the MAC-capabilities init/re-init to the phylink MAC-capabilities
getter.

Link: https://lore.kernel.org/netdev/20240412180340.7965-1-fancer.lancer@gmail.com/
Changelog v2:
- Add a new patch (Romain):
  [PATCH net-next v2 1/2] net: stmmac: Rename phylink_get_caps() callback to update_caps()
- Resubmit the leftover patches to net-next tree (Paolo).

Link: https://lore.kernel.org/netdev/20240417140013.12575-1-fancer.lancer@gmail.com/
Changelog v3:
- Just resubmit (Jakub).
Signed-off-by: Serge Semin <fancer.lancer@gmail.com>
====================

Link: https://lore.kernel.org/r/20240419090357.5547-1-fancer.lancer@gmail.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>

57f15912

net: stmmac: Move MAC caps init to phylink MAC caps getter · f951a649

Serge Semin authored Apr 19, 2024

After a set of recent fixes the stmmac_phy_setup() and
stmmac_reinit_queues() methods have turned to having some duplicated code.
Let's get rid from the duplication by moving the MAC-capabilities
initialization to the PHYLINK MAC-capabilities getter. The getter is
called during each network device interface open/close cycle. So the
MAC-capabilities will be initialized in generic device open procedure and
in case of the Tx/Rx queues re-initialization as the original code
semantics implies.
Signed-off-by: Serge Semin <fancer.lancer@gmail.com>
Reviewed-by: Romain Gantois <romain.gantois@bootlin.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

f951a649

net: stmmac: Rename phylink_get_caps() callback to update_caps() · dc144bae

Serge Semin authored Apr 19, 2024

Since recent commits the stmmac_ops::phylink_get_caps() callback has no
longer been responsible for the phylink MAC capabilities getting, but
merely updates the MAC capabilities in the mac_device_info::link::caps
field. Rename the callback to comply with the what the method does now.
Signed-off-by: Serge Semin <fancer.lancer@gmail.com>
Reviewed-by: Romain Gantois <romain.gantois@bootlin.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

dc144bae

Merge branch 'enable-rx-hw-timestamp-for-ptp-packets-using-cpts-fifo' · 30b3fe06

Paolo Abeni authored Apr 23, 2024

Chintan Vankar says:

====================
Enable RX HW timestamp for PTP packets using CPTS FIFO

The CPSW offers two mechanisms for communicating packet ingress timestamp
information to the host.

The first mechanism is via the CPTS Event FIFO which records timestamp
when triggered by certain events. One such event is the reception of an
Ethernet packet with a specified EtherType field. This is used to capture
ingress timestamps for PTP packets. With this mechanism the host must
read the timestamp (from the CPTS FIFO) separately from the packet payload
which is delivered via DMA.

In the second mechanism of timestamping, CPSW driver enables hardware
timestamping for all received packets by setting the TSTAMP_EN bit in
CPTS_CONTROL register, which directs the CPTS module to timestamp all
received packets, followed by passing timestamp via DMA descriptors.
This mechanism is responsible for triggering errata i2401:
"CPSW: Host Timestamps Cause CPSW Port to Lock up."

The errata affects all K3 SoCs. Link to errata for AM64x:
https://www.ti.com/lit/er/sprz457h/sprz457h.pdf

As a workaround we can use first mechanism to timestamp received
packets.
====================

Link: https://lore.kernel.org/r/20240419082626.57225-1-c-vankar@ti.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>

30b3fe06

net: ethernet: ti: am65-cpsw/ethtool: Enable RX HW timestamp only for PTP packets · c03a6fd3

Chintan Vankar authored Apr 19, 2024

In the current mechanism of timestamping, am65-cpsw-nuss driver
enables hardware timestamping for all received packets by setting
the TSTAMP_EN bit in CPTS_CONTROL register, which directs the CPTS
module to timestamp all received packets, followed by passing
timestamp via DMA descriptors. This mechanism causes CPSW Port to
Lock up.

To prevent port lock up, don't enable rx packet timestamping by
setting TSTAMP_EN bit in CPTS_CONTROL register. The workaround for
timestamping received packets is to utilize the CPTS Event FIFO
that records timestamps corresponding to certain events. The CPTS
module is configured to generate timestamps for Multicast Ethernet,
UDP/IPv4 and UDP/IPv6 PTP packets.

Update supported hwtstamp_rx_filters values for CPSW's timestamping
capability.

Fixes: b1f66a5b ("net: ethernet: ti: am65-cpsw-nuss: enable packet timestamping support")
Signed-off-by: Chintan Vankar <c-vankar@ti.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

c03a6fd3

net: ethernet: ti: am65-cpts: Enable RX HW timestamp for PTP packets using CPTS FIFO · c459f606

Chintan Vankar authored Apr 19, 2024

Add a new function "am65_cpts_rx_timestamp()" which checks for PTP
packets from header and timestamps them.

Add another function "am65_cpts_find_rx_ts()" which finds CPTS FIFO
Event to get the timestamp of received PTP packet.
Signed-off-by: Chintan Vankar <c-vankar@ti.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

c459f606

Merge branch 'read-phy-address-of-switch-from-device-tree-on-mt7530-dsa-subdriver' · 9b9fd023

Paolo Abeni authored Apr 23, 2024

Arınç ÜNAL says:

====================
Read PHY address of switch from device tree on MT7530 DSA subdriver

This patch series makes the driver read the PHY address the switch listens
on from the device tree which, in result, brings support for MT7530
switches listening on a different PHY address than 31. And the patch series
simplifies the core operations.
Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com>
====================

Link: https://lore.kernel.org/r/20240418-b4-for-netnext-mt7530-phy-addr-from-dt-and-simplify-core-ops-v3-0-3b5fb249b004@arinc9.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>

9b9fd023

net: dsa: mt7530: simplify core operations · 7c5e37d7

Arınç ÜNAL authored Apr 18, 2024

The core_rmw() function calls core_read_mmd_indirect() to read the
requested register, and then calls core_write_mmd_indirect() to write the
requested value to the register. Because Clause 22 is used to access Clause
45 registers, some operations on core_write_mmd_indirect() are
unnecessarily run. Get rid of core_read_mmd_indirect() and
core_write_mmd_indirect(), and run only the necessary operations on
core_write() and core_rmw().
Reviewed-by: Daniel Golle <daniel@makrotopia.org>
Tested-by: Daniel Golle <daniel@makrotopia.org>
Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

7c5e37d7

net: dsa: mt7530-mdio: read PHY address of switch from device tree · 868ff5f4

Arınç ÜNAL authored Apr 18, 2024

Read the PHY address the switch listens on from the reg property of the
switch node on the device tree. This change brings support for MT7530
switches on boards with such bootstrapping configuration where the switch
listens on a different PHY address than the hardcoded PHY address on the
driver, 31.

As described on the "MT7621 Programming Guide v0.4" document, the MT7530
switch and its PHYs can be configured to listen on the range of 7-12,
15-20, 23-28, and 31 and 0-4 PHY addresses.

There are operations where the switch PHY registers are used. For the PHY
address of the control PHY, transform the MT753X_CTRL_PHY_ADDR constant
into a macro and use it. The PHY address for the control PHY is 0 when the
switch listens on 31. In any other case, it is one greater than the PHY
address the switch listens on.
Reviewed-by: Daniel Golle <daniel@makrotopia.org>
Tested-by: Daniel Golle <daniel@makrotopia.org>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

868ff5f4

net: ethernet: mtk_eth_soc: flower: validate control flags · 077633af

Asbjørn Sloth Tønnesen authored Apr 18, 2024

This driver currently doesn't support any control flags.

Use flow_rule_has_control_flags() to check for control flags,
such as can be set through `tc flower ... ip_flags frag`.

In case any control flags are masked, flow_rule_has_control_flags()
sets a NL extended error message, and we return -EOPNOTSUPP.

Only compile-tested.
Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240418161821.189263-1-ast@fiberby.netSigned-off-by: Jakub Kicinski <kuba@kernel.org>

077633af

dpaa2-switch: flower: validate control flags · af7dfa94

Asbjørn Sloth Tønnesen authored Apr 18, 2024

This driver currently doesn't support any control flags.

Use flow_rule_match_has_control_flags() to check for control flags,
such as can be set through `tc flower ... ip_flags frag`.

In case any control flags are masked, flow_rule_match_has_control_flags()
sets a NL extended error message, and we return -EOPNOTSUPP.

Only compile-tested.
Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Tested-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Link: https://lore.kernel.org/r/20240418161802.189247-1-ast@fiberby.netSigned-off-by: Jakub Kicinski <kuba@kernel.org>

af7dfa94

cxgb4: flower: validate control flags · 93a8540a

Asbjørn Sloth Tønnesen authored Apr 18, 2024

This driver currently doesn't support any control flags.

Use flow_rule_match_has_control_flags() to check for control flags,
such as can be set through `tc flower ... ip_flags frag`.

In case any control flags are masked, flow_rule_match_has_control_flags()
sets a NL extended error message, and we return -EOPNOTSUPP.

Only compile-tested.

Only compile tested, no hardware available.
Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240418161751.189226-1-ast@fiberby.netSigned-off-by: Jakub Kicinski <kuba@kernel.org>

93a8540a

net: openvswitch: Check vport netdev name · 2540088b

Jun Gu authored Apr 19, 2024

Ensure that the provided netdev name is not one of its aliases to
prevent unnecessary creation and destruction of the vport by
ovs-vswitchd.
Signed-off-by: Jun Gu <jun.gu@easystack.cn>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Link: https://lore.kernel.org/r/20240419061425.132723-1-jun.gu@easystack.cnSigned-off-by: Jakub Kicinski <kuba@kernel.org>

2540088b

Merge branch 'netlink-add-nftables-spec-w-multi-messages' · 2557e2ec

Jakub Kicinski authored Apr 22, 2024

Donald Hunter says:

====================
netlink: Add nftables spec w/ multi messages

This series adds a ynl spec for nftables and extends ynl with a --multi
command line option that makes it possible to send transactional batches
for nftables.

This series includes a patch for nfnetlink which adds ACK processing for
batch begin/end messages. If you'd prefer that to be sent separately to
nf-next then I can do so, but I included it here so that it gets seen in
context.

An example of usage is:

./tools/net/ynl/cli.py \
 --spec Documentation/netlink/specs/nftables.yaml \
 --multi batch-begin '{"res-id": 10}' \
 --multi newtable '{"name": "test", "nfgen-family": 1}' \
 --multi newchain '{"name": "chain", "table": "test", "nfgen-family": 1}' \
 --multi batch-end '{"res-id": 10}'
[None, None, None, None]

It can also be used for bundling get requests:

./tools/net/ynl/cli.py \
 --spec Documentation/netlink/specs/nftables.yaml \
 --multi gettable '{"name": "test", "nfgen-family": 1}' \
 --multi getchain '{"name": "chain", "table": "test", "nfgen-family": 1}' \
 --output-json
[{"name": "test", "use": 1, "handle": 1, "flags": [],
 "nfgen-family": 1, "version": 0, "res-id": 2},
 {"table": "test", "name": "chain", "handle": 1, "use": 0,
 "nfgen-family": 1, "version": 0, "res-id": 2}]

There are 2 issues that may be worth resolving:

 - ynl reports errors by raising an NlError exception so only the first
   error gets reported. This could be changed to add errors to the list
   of responses so that multiple errors could be reported.

 - If any message does not get a response (e.g. batch-begin w/o patch 2)
   then ynl waits indefinitely. A recv timeout could be added which
   would allow ynl to terminate.
====================

Link: https://lore.kernel.org/r/20240418104737.77914-1-donald.hunter@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

2557e2ec

netfilter: nfnetlink: Handle ACK flags for batch messages · bf2ac490

Donald Hunter authored Apr 18, 2024

The NLM_F_ACK flag is ignored for nfnetlink batch begin and end
messages. This is a problem for ynl which wants to receive an ack for
every message it sends, not just the commands in between the begin/end
messages.

Add processing for ACKs for begin/end messages and provide responses
when requested.

I have checked that iproute2, pyroute2 and systemd are unaffected by
this change since none of them use NLM_F_ACK for batch begin/end.
Signed-off-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://lore.kernel.org/r/20240418104737.77914-5-donald.hunter@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

bf2ac490

tools/net/ynl: Add multi message support to ynl · ba8be00f

Donald Hunter authored Apr 18, 2024

Add a "--multi <do-op> <json>" command line to ynl that makes it
possible to add several operations to a single netlink request payload.
The --multi command line option is repeated for each operation.

This is used by the nftables family for transaction batches. For
example:

./tools/net/ynl/cli.py \
 --spec Documentation/netlink/specs/nftables.yaml \
 --multi batch-begin '{"res-id": 10}' \
 --multi newtable '{"name": "test", "nfgen-family": 1}' \
 --multi newchain '{"name": "chain", "table": "test", "nfgen-family": 1}' \
 --multi batch-end '{"res-id": 10}'
[None, None, None, None]

It can also be used for bundling get requests:

./tools/net/ynl/cli.py \
 --spec Documentation/netlink/specs/nftables.yaml \
 --multi gettable '{"name": "test", "nfgen-family": 1}' \
 --multi getchain '{"name": "chain", "table": "test", "nfgen-family": 1}' \
 --output-json
[{"name": "test", "use": 1, "handle": 1, "flags": [],
 "nfgen-family": 1, "version": 0, "res-id": 2},
 {"table": "test", "name": "chain", "handle": 1, "use": 0,
 "nfgen-family": 1, "version": 0, "res-id": 2}]
Signed-off-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://lore.kernel.org/r/20240418104737.77914-4-donald.hunter@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

ba8be00f

tools/net/ynl: Fix extack decoding for directional ops · 0a966d60

Donald Hunter authored Apr 18, 2024

NetlinkProtocol.decode() was looking up ops by response value which breaks
when it is used for extack decoding of directional ops. Instead, pass
the op to decode().
Signed-off-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://lore.kernel.org/r/20240418104737.77914-3-donald.hunter@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

0a966d60

doc/netlink/specs: Add draft nftables spec · 1ee73168

Donald Hunter authored Apr 18, 2024

Add a spec for nftables that has nearly complete coverage of the ops,
but limited coverage of rule types and subexpressions.
Signed-off-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://lore.kernel.org/r/20240418104737.77914-2-donald.hunter@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

1ee73168

Merge branch 'for-uring-ubufops' into HEAD · af046fd1

Jakub Kicinski authored Apr 22, 2024

Pavel Begunkov says:

====================
implement io_uring notification (ubuf_info) stacking (net part)

To have per request buffer notifications each zerocopy io_uring send
request allocates a new ubuf_info. However, as an skb can carry only
one uarg, it may force the stack to create many small skbs hurting
performance in many ways.

The patchset implements notification, i.e. an io_uring's ubuf_info
extension, stacking. It attempts to link ubuf_info's into a list,
allowing to have multiple of them per skb.

liburing/examples/send-zerocopy shows up 6 times performance improvement
for TCP with 4KB bytes per send, and levels it with MSG_ZEROCOPY. Without
the patchset it requires much larger sends to utilise all potential.

bytes  | before | after (Kqps)
1200   | 195    | 1023
4000   | 193    | 1386
8000   | 154    | 1058
====================

Link: https://lore.kernel.org/all/cover.1713369317.git.asml.silence@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

af046fd1

22 Apr, 2024 7 commits

net: add callback for setting a ubuf_info to skb · 65bada80

Pavel Begunkov authored Apr 19, 2024

At the moment an skb can only have one ubuf_info associated with it,
which might be a performance problem for zerocopy sends in cases like
TCP via io_uring. Add a callback for assigning ubuf_info to skb, this
way we will implement smarter assignment later like linking ubuf_info
together.

Note, it's an optional callback, which should be compatible with
skb_zcopy_set(), that's because the net stack might potentially decide
to clone an skb and take another reference to ubuf_info whenever it
wishes. Also, a correct implementation should always be able to bind to
an skb without prior ubuf_info, otherwise we could end up in a situation
when the send would not be able to progress.
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/all/b7918aadffeb787c84c9e72e34c729dc04f3a45d.1713369317.git.asml.silence@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

65bada80

net: extend ubuf_info callback to ops structure · 7ab4f16f

Pavel Begunkov authored Apr 19, 2024

We'll need to associate additional callbacks with ubuf_info, introduce
a structure holding ubuf_info callbacks. Apart from a more smarter
io_uring notification management introduced in next patches, it can be
used to generalise msg_zerocopy_put_abort() and also store
->sg_from_iter, which is currently passed in struct msghdr.
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/all/a62015541de49c0e2a8a0377a1d5d0a5aeb07016.1713369317.git.asml.silence@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

7ab4f16f

Merge branch 'tcp-avoid-sending-too-small-packets' · 65f1df11

Jakub Kicinski authored Apr 22, 2024

Eric Dumazet says:

====================
tcp: avoid sending too small packets

tcp_sendmsg() cooks 'large' skbs, that are later split
if needed from tcp_write_xmit().

After a split, the leftover skb size is smaller than the optimal
size, and this causes a performance drop.

In this series, tcp_grow_skb() helper is added to shift
payload from the second skb in the write queue to the first
skb to always send optimal sized skbs.

This increases TSO efficiency, and decreases number of ACK
packets.
====================

Link: https://lore.kernel.org/r/20240418214600.1291486-1-edumazet@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

65f1df11

tcp: try to send bigger TSO packets · 8ee602c6

Eric Dumazet authored Apr 18, 2024

While investigating TCP performance, I found that TCP would
sometimes send big skbs followed by a single MSS skb,
in a 'locked' pattern.

For instance, BIG TCP is enabled, MSS is set to have 4096 bytes
of payload per segment. gso_max_size is set to 181000.

This means that an optimal TCP packet size should contain
44 * 4096 = 180224 bytes of payload,

However, I was seeing packets sizes interleaved in this pattern:

172032, 8192, 172032, 8192, 172032, 8192, <repeat>

tcp_tso_should_defer() heuristic is defeated, because after a split of
a packet in write queue for whatever reason (this might be a too small
CWND or a small enough pacing_rate),
the leftover packet in the queue is smaller than the optimal size.

It is time to try to make 'leftover packets' bigger so that
tcp_tso_should_defer() can give its full potential.

After this patch, we can see the following output:

14:13:34.009273 IP6 sender > receiver: Flags [P.], seq 4048380:4098360, ack 1, win 256, options [nop,nop,TS val 3425678144 ecr 1561784500], length 49980
14:13:34.010272 IP6 sender > receiver: Flags [P.], seq 4098360:4148340, ack 1, win 256, options [nop,nop,TS val 3425678145 ecr 1561784501], length 49980
14:13:34.011271 IP6 sender > receiver: Flags [P.], seq 4148340:4198320, ack 1, win 256, options [nop,nop,TS val 3425678146 ecr 1561784502], length 49980
14:13:34.012271 IP6 sender > receiver: Flags [P.], seq 4198320:4248300, ack 1, win 256, options [nop,nop,TS val 3425678147 ecr 1561784503], length 49980
14:13:34.013272 IP6 sender > receiver: Flags [P.], seq 4248300:4298280, ack 1, win 256, options [nop,nop,TS val 3425678148 ecr 1561784504], length 49980
14:13:34.014271 IP6 sender > receiver: Flags [P.], seq 4298280:4348260, ack 1, win 256, options [nop,nop,TS val 3425678149 ecr 1561784505], length 49980
14:13:34.015272 IP6 sender > receiver: Flags [P.], seq 4348260:4398240, ack 1, win 256, options [nop,nop,TS val 3425678150 ecr 1561784506], length 49980
14:13:34.016270 IP6 sender > receiver: Flags [P.], seq 4398240:4448220, ack 1, win 256, options [nop,nop,TS val 3425678151 ecr 1561784507], length 49980
14:13:34.017269 IP6 sender > receiver: Flags [P.], seq 4448220:4498200, ack 1, win 256, options [nop,nop,TS val 3425678152 ecr 1561784508], length 49980
14:13:34.018276 IP6 sender > receiver: Flags [P.], seq 4498200:4548180, ack 1, win 256, options [nop,nop,TS val 3425678153 ecr 1561784509], length 49980
14:13:34.019259 IP6 sender > receiver: Flags [P.], seq 4548180:4598160, ack 1, win 256, options [nop,nop,TS val 3425678154 ecr 1561784510], length 49980

With 200 concurrent flows on a 100Gbit NIC, we can see a reduction
of TSO packets (and ACK packets) of about 30 %.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240418214600.1291486-4-edumazet@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

8ee602c6

tcp: call tcp_set_skb_tso_segs() from tcp_write_xmit() · d5b38a71

Eric Dumazet authored Apr 18, 2024

tcp_write_xmit() calls tcp_init_tso_segs()
to set gso_size and gso_segs on the packet.

tcp_init_tso_segs() requires the stack to maintain
an up to date tcp_skb_pcount(), and this makes sense
for packets in rtx queue. Not so much for packets
still in the write queue.

In the following patch, we don't want to deal with
tcp_skb_pcount() when moving payload from 2nd
skb to 1st skb in the write queue.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240418214600.1291486-3-edumazet@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

d5b38a71

tcp: remove dubious FIN exception from tcp_cwnd_test() · 22555032

Eric Dumazet authored Apr 18, 2024

tcp_cwnd_test() has a special handing for the last packet in
the write queue if it is smaller than one MSS and has the FIN flag.

This is in violation of TCP RFC, and seems quite dubious.

This packet can be sent only if the current CWND is bigger
than the number of packets in flight.

Making tcp_cwnd_test() result independent of the first skb
in the write queue is needed for the last patch of the series.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240418214600.1291486-2-edumazet@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

22555032

Merge branch 'mlx5e-per-queue-coalescing' · f62a5e71

Jakub Kicinski authored Apr 22, 2024

Tariq Toukan says:

====================
mlx5e per-queue coalescing

This patchset adds ethtool per-queue coalescing support for the mlx5e
driver.

The series introduce some changes needed as preparations for the final
patch which adds the support and implements the callbacks.  Main
changes:
- DIM code movements into its own header file.
- Switch to dynamic allocation of the DIM struct in the RQs/SQs.
- Allow coalescing config change without channels reset when possible.
====================

Link: https://lore.kernel.org/r/20240419080445.417574-1-tariqt@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

f62a5e71