Commits · e93af72286054e76d71aa5b7d721f33667ce0385 · Kirill Smelkov / linux

29 Mar, 2024 40 commits

nfc: st95hf: drop driver owner assignment · e93af722

Krzysztof Kozlowski authored Mar 27, 2024

Core in spi_register_driver() already sets the .owner, so driver
does not need to.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://lore.kernel.org/r/20240327174810.519676-4-krzysztof.kozlowski@linaro.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

e93af722

nfc: mrvl: spi: drop driver owner assignment · e3c95d56

Krzysztof Kozlowski authored Mar 27, 2024

Core in spi_register_driver() already sets the .owner, so driver
does not need to.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://lore.kernel.org/r/20240327174810.519676-3-krzysztof.kozlowski@linaro.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

e3c95d56

net: wwan: mhi: drop driver owner assignment · 34394120

Krzysztof Kozlowski authored Mar 27, 2024

Core in mhi_driver_register() already sets the .owner, so driver
does not need to.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Acked-by: Sergey Ryazanov <ryazanov.s.a@gmail.com>
Link: https://lore.kernel.org/r/20240327174810.519676-2-krzysztof.kozlowski@linaro.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

34394120

net: microchip: encx24j600: drop driver owner assignment · 648bb2bf

Krzysztof Kozlowski authored Mar 27, 2024

Core in spi_register_driver() already sets the .owner, so driver
does not need to.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://lore.kernel.org/r/20240327174810.519676-1-krzysztof.kozlowski@linaro.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

648bb2bf

tools/net/ynl: Add extack policy attribute decoding · bd3ce405

Donald Hunter authored Mar 28, 2024

The NLMSGERR_ATTR_POLICY extack attribute has been ignored by ynl up to
now. Extend extack decoding to include _POLICY and the nested
NL_POLICY_TYPE_ATTR_* attributes.

For example:

./tools/net/ynl/cli.py \
  --spec Documentation/netlink/specs/rt_link.yaml \
  --create --do newlink --json '{
    "ifname": "12345678901234567890",
    "linkinfo": {"kind": "bridge"}
    }'
Netlink error: Numerical result out of range
nl_len = 104 (88) nl_flags = 0x300 nl_type = 2
	error: -34	extack: {'msg': 'Attribute failed policy validation',
'policy': {'max-length': 15, 'type': 'string'}, 'bad-attr': '.ifname'}
Signed-off-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://lore.kernel.org/r/20240328155636.64688-1-donald.hunter@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

bd3ce405

devlink: use kvzalloc() to allocate devlink instance resources · 730fffce

Jian Wen authored Mar 27, 2024

During live migration of a virtual machine, the SR-IOV VF need to be
re-registered. It may fail when the memory is badly fragmented.

The related log is as follows.

    kernel: hv_netvsc 6045bdaa-c0d1-6045-bdaa-c0d16045bdaa eth0: VF slot 1 added
...
    kernel: kworker/0:0: page allocation failure: order:7, mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null),cpuset=/,mems_allowed=0
    kernel: CPU: 0 PID: 24006 Comm: kworker/0:0 Tainted: G            E     5.4...x86_64 #1
    kernel: Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090008  12/07/2018
    kernel: Workqueue: events work_for_cpu_fn
    kernel: Call Trace:
    kernel: dump_stack+0x8b/0xc8
    kernel: warn_alloc+0xff/0x170
    kernel: __alloc_pages_slowpath+0x92c/0xb2b
    kernel: ? get_page_from_freelist+0x1d4/0x1140
    kernel: __alloc_pages_nodemask+0x2f9/0x320
    kernel: alloc_pages_current+0x6a/0xb0
    kernel: kmalloc_order+0x1e/0x70
    kernel: kmalloc_order_trace+0x26/0xb0
    kernel: ? __switch_to_asm+0x34/0x70
    kernel: __kmalloc+0x276/0x280
    kernel: ? _raw_spin_unlock_irqrestore+0x1e/0x40
    kernel: devlink_alloc+0x29/0x110
    kernel: mlx5_devlink_alloc+0x1a/0x20 [mlx5_core]
    kernel: init_one+0x1d/0x650 [mlx5_core]
    kernel: local_pci_probe+0x46/0x90
    kernel: work_for_cpu_fn+0x1a/0x30
    kernel: process_one_work+0x16d/0x390
    kernel: worker_thread+0x1d3/0x3f0
    kernel: kthread+0x105/0x140
    kernel: ? max_active_store+0x80/0x80
    kernel: ? kthread_bind+0x20/0x20
    kernel: ret_from_fork+0x3a/0x50
Signed-off-by: Jian Wen <wenjian1@xiaomi.com>
Link: https://lore.kernel.org/r/20240327082128.942818-1-wenjian1@xiaomi.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

730fffce

Merge branch 'enabled-wformat-truncation-for-clang' · 45e0ba9c

Jakub Kicinski authored Mar 29, 2024

Arnd Bergmann says:

====================
enabled -Wformat-truncation for clang

With randconfig build testing, I found only eight files that produce
warnings with clang when -Wformat-truncation is enabled. This means
we can just turn it on by default rather than only enabling it for
"make W=1".

Unfortunately, gcc produces a lot more warnings when the option
is enabled, so it's not yet possible to turn it on both compilers.
====================

Link: https://lore.kernel.org/r/20240326223825.4084412-1-arnd@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

45e0ba9c

mlx5: avoid truncating error message · b324a960

Arnd Bergmann authored Mar 26, 2024

clang warns that one error message is too long for its destination buffer:

drivers/net/ethernet/mellanox/mlx5/core/esw/bridge.c:1876:4: error: 'snprintf' will always be truncated; specified size is 80, but format string expands to at least 94 [-Werror,-Wformat-truncation-non-kprintf]

Reword it to be a bit shorter so it always fits.

Fixes: 70f0302b ("net/mlx5: Bridge, implement mdb offload")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Subbaraya Sundeep <sbhatta@marvell.com>
Link: https://lore.kernel.org/r/20240326223825.4084412-5-arnd@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

b324a960

qed: avoid truncating work queue length · 954fd908

Arnd Bergmann authored Mar 26, 2024

clang complains that the temporary string for the name passed into
alloc_workqueue() is too short for its contents:

drivers/net/ethernet/qlogic/qed/qed_main.c:1218:3: error: 'snprintf' will always be truncated; specified size is 16, but format string expands to at least 18 [-Werror,-Wformat-truncation]

There is no need for a temporary buffer, and the actual name of a workqueue
is 32 bytes (WQ_NAME_LEN), so just use the interface as intended to avoid
the truncation.

Fixes: 59ccf86f ("qed: Add driver infrastucture for handling mfw requests.")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20240326223825.4084412-4-arnd@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

954fd908

enetc: avoid truncating error message · 9046d581

Arnd Bergmann authored Mar 26, 2024

As clang points out, the error message in enetc_setup_xdp_prog()
still does not fit in the buffer and will be truncated:

drivers/net/ethernet/freescale/enetc/enetc.c:2771:3: error: 'snprintf' will always be truncated; specified size is 80, but format string expands to at least 87 [-Werror,-Wformat-truncation]

Replace it with an even shorter message that should fit.

Fixes: f968c564 ("net: enetc: shorten enetc_setup_xdp_prog() error message to fit NETLINK_MAX_FMTMSG_LEN")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20240326223825.4084412-3-arnd@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

9046d581

octeontx2-af: Increase maximum BPID channels · b6694abc

Radha Mohan Chintakuntla authored Mar 26, 2024

Any NIX interface type can have maximum 256 channels. So increased the
backpressure ID count to 256 so that it can cover cn9k and cn10k SoCs that
have different NIX interface types with varied maximum channels.
Signed-off-by: Radha Mohan Chintakuntla <radhac@marvell.com>
Link: https://lore.kernel.org/r/20240326184514.1628284-1-radhac@marvell.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

b6694abc

Merge branch 'add-ip-port-information-to-udp-drop-tracepoint' · e6091ba5

Jakub Kicinski authored Mar 29, 2024

Balazs Scheidler says:

====================
Add IP/port information to UDP drop tracepoint

In our use-case we would like to recover the properties of dropped UDP
packets. Unfortunately the current udp_fail_queue_rcv_skb tracepoint
only exposes the port number of the receiving socket.

This patch-set will add the source/dest ip/port to the tracepoint, while
keeping the socket's local port as well for compatibility.

Thanks for the review comments by Jason and Kuniyuki, they helped me a lot
and I tried to address all of their comments in this new iteration.
====================

Link: https://lore.kernel.org/r/cover.1711475011.git.balazs.scheidler@axoflow.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

e6091ba5

net: udp: add IP/port data to the tracepoint udp/udp_fail_queue_rcv_skb · e9669a00

Balazs Scheidler authored Mar 26, 2024

The udp_fail_queue_rcv_skb() tracepoint lacks any details on the source
and destination IP/port whereas this information can be critical in case
of UDP/syslog.
Signed-off-by: Balazs Scheidler <balazs.scheidler@axoflow.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://lore.kernel.org/r/0c8b3e33dbf679e190be6f4c6736603a76988a20.1711475011.git.balazs.scheidler@axoflow.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

e9669a00

net: port TP_STORE_ADDR_PORTS_SKB macro to be tcp/udp independent · a0ad11fc

Balazs Scheidler authored Mar 26, 2024

This patch moves TP_STORE_ADDR_PORTS_SKB() to a common header and removes
the TCP specific implementation details.

Previously the macro assumed the skb passed as an argument is a
TCP packet, the implementation now uses an argument to the L4 header and
uses that to extract the source/destination ports, which happen
to be named the same in "struct tcphdr" and "struct udphdr"
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Signed-off-by: Balazs Scheidler <balazs.scheidler@axoflow.com>
Link: https://lore.kernel.org/r/9e306f78260dfbbdc7353ba5f864cc027a409540.1711475011.git.balazs.scheidler@axoflow.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

a0ad11fc

nfc: st95hf: Switch to using gpiod API · b7b1c590

Andy Shevchenko authored Mar 26, 2024

This updates the driver to gpiod API, and removes yet another use of
of_get_named_gpio().
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://lore.kernel.org/r/20240326175836.1418718-1-andriy.shevchenko@linux.intel.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

b7b1c590

Merge branch 'add-en8811h-phy-driver-and-devicetree-binding-doc' · b125f3eb

Jakub Kicinski authored Mar 29, 2024

Eric Woudstra says:

====================
Add en8811h phy driver and devicetree binding doc

This patch series adds the driver and the devicetree binding documentation
for the Airoha en8811h PHY.
====================

Link: https://lore.kernel.org/r/20240326162305.303598-1-ericwouds@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

b125f3eb

net: phy: air_en8811h: Add the Airoha EN8811H PHY driver · 71e79430

Eric Woudstra authored Mar 26, 2024

Add the driver for the Airoha EN8811H 2.5 Gigabit PHY. The phy supports
100/1000/2500 Mbps with auto negotiation only.

The driver uses two firmware files, for which updated versions are added to
linux-firmware already.

Note: At phy-address + 8 there is another device on the mdio bus, that
belongs to the EN881H. While the original driver writes to it, Airoha
has confirmed this is not needed. Therefore, communication with this
device is not included in this driver.
Signed-off-by: Eric Woudstra <ericwouds@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20240326162305.303598-3-ericwouds@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

71e79430

dt-bindings: net: airoha,en8811h: Add en8811h · 2434ba2b

Eric Woudstra authored Mar 26, 2024

Add the Airoha EN8811H 2.5 Gigabit PHY.

The en8811h phy can be set with serdes polarity reversed on rx and/or tx.
Signed-off-by: Eric Woudstra <ericwouds@gmail.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20240326162305.303598-2-ericwouds@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

2434ba2b

Merge branch 'af_unix-rework-gc' · da493dbb

Jakub Kicinski authored Mar 29, 2024

Kuniyuki Iwashima says:

====================
af_unix: Rework GC.

When we pass a file descriptor to an AF_UNIX socket via SCM_RIGTHS,
the underlying struct file of the inflight fd gets its refcount bumped.
If the fd is of an AF_UNIX socket, we need to track it in case it forms
cyclic references.

Let's say we send a fd of AF_UNIX socket A to B and vice versa and
close() both sockets.

When created, each socket's struct file initially has one reference.
After the fd exchange, both refcounts are bumped up to 2.  Then, close()
decreases both to 1.  From this point on, no one can touch the file/socket.

However, the struct file has one refcount and thus never calls the
release() function of the AF_UNIX socket.

That's why we need to track all inflight AF_UNIX sockets and run garbage
collection.

This series replaces the current GC implementation that locks each inflight
socket's receive queue and requires trickiness in other places.

The new GC does not lock each socket's queue to minimise its effect and
tries to be lightweight if there is no cyclic reference or no update in
the shape of the inflight fd graph.

The new implementation is based on Tarjan's Strongly Connected Components
algorithm, and we will consider each inflight AF_UNIX socket as a vertex
and its file descriptor as an edge in a directed graph.

For the details, please see each patch.

  patch 1  -  3 : Add struct to express inflight socket graphs
  patch       4 : Optimse inflight fd counting
  patch 5  -  6 : Group SCC possibly forming a cycle
  patch 7  -  8 : Support embryo socket
  patch 9  - 11 : Make GC lightweight
  patch 12 - 13 : Detect dead cycle references
  patch      14 : Replace GC algorithm
  patch      15 : selftest

After this series is applied, we can remove the two ugly tricks for race,
scm_fp_dup() in unix_attach_fds() and spin_lock dance in unix_peek_fds()
as done in patch 14/15 of v1.

Also, we will add cond_resched_lock() in __unix_gc() and convert it to
use a dedicated kthread instead of global system workqueue as suggested
by Paolo in a v4 thread.

v4: https://lore.kernel.org/netdev/20240301022243.73908-1-kuniyu@amazon.com/
v3: https://lore.kernel.org/netdev/20240223214003.17369-1-kuniyu@amazon.com/
v2: https://lore.kernel.org/netdev/20240216210556.65913-1-kuniyu@amazon.com/
v1: https://lore.kernel.org/netdev/20240203030058.60750-1-kuniyu@amazon.com/
====================

Link: https://lore.kernel.org/r/20240325202425.60930-1-kuniyu@amazon.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

da493dbb

selftest: af_unix: Test GC for SCM_RIGHTS. · 2aa0cff2

Kuniyuki Iwashima authored Mar 25, 2024

This patch adds test cases to verify the new GC.

We run each test for the following cases:

  * SOCK_DGRAM
  * SOCK_STREAM without embryo socket
  * SOCK_STREAM without embryo socket + MSG_OOB
  * SOCK_STREAM with embryo sockets
  * SOCK_STREAM with embryo sockets + MSG_OOB

Before and after running each test case, we ensure that there is
no AF_UNIX socket left in the netns by reading /proc/net/protocols.

We cannot use /proc/net/unix and UNIX_DIAG because the embryo socket
does not show up there.

Each test creates multiple sockets in an array.  We pass sockets in
the even index using the peer sockets in the odd index.

So, send_fd(0, 1) actually sends fd[0] to fd[2] via fd[0 + 1].

  Test 1 : A <-> A
  Test 2 : A <-> B
  Test 3 : A -> B -> C <- D
           ^.___|___.'    ^
                `---------'
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20240325202425.60930-16-kuniyu@amazon.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

2aa0cff2

af_unix: Replace garbage collection algorithm. · 4090fa37

Kuniyuki Iwashima authored Mar 25, 2024

If we find a dead SCC during iteration, we call unix_collect_skb()
to splice all skb in the SCC to the global sk_buff_head, hitlist.

After iterating all SCC, we unlock unix_gc_lock and purge the queue.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20240325202425.60930-15-kuniyu@amazon.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

4090fa37

af_unix: Detect dead SCC. · a15702d8

Kuniyuki Iwashima authored Mar 25, 2024

When iterating SCC, we call unix_vertex_dead() for each vertex
to check if the vertex is close()d and has no bridge to another
SCC.

If both conditions are true for every vertex in SCC, we can
execute garbage collection for all skb in the SCC.

The actual garbage collection is done in the following patch,
replacing the old implementation.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20240325202425.60930-14-kuniyu@amazon.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

a15702d8

af_unix: Assign a unique index to SCC. · bfdb0128

Kuniyuki Iwashima authored Mar 25, 2024

The definition of the lowlink in Tarjan's algorithm is the
smallest index of a vertex that is reachable with at most one
back-edge in SCC.  This is not useful for a cross-edge.

If we start traversing from A in the following graph, the final
lowlink of D is 3.  The cross-edge here is one between D and C.

  A -> B -> D   D = (4, 3)  (index, lowlink)
  ^    |    |   C = (3, 1)
  |    V    |   B = (2, 1)
  `--- C <--'   A = (1, 1)

This is because the lowlink of D is updated with the index of C.

In the following patch, we detect a dead SCC by checking two
conditions for each vertex.

  1) vertex has no edge directed to another SCC (no bridge)
  2) vertex's out_degree is the same as the refcount of its file

If 1) is false, there is a receiver of all fds of the SCC and
its ancestor SCC.

To evaluate 1), we need to assign a unique index to each SCC and
assign it to all vertices in the SCC.

This patch changes the lowlink update logic for cross-edge so
that in the example above, the lowlink of D is updated with the
lowlink of C.

  A -> B -> D   D = (4, 1)  (index, lowlink)
  ^    |    |   C = (3, 1)
  |    V    |   B = (2, 1)
  `--- C <--'   A = (1, 1)

Then, all vertices in the same SCC have the same lowlink, and we
can quickly find the bridge connecting to different SCC if exists.

However, it is no longer called lowlink, so we rename it to
scc_index.  (It's sometimes called lowpoint.)

Also, we add a global variable to hold the last index used in DFS
so that we do not reset the initial index in each DFS.

This patch can be squashed to the SCC detection patch but is
split deliberately for anyone wondering why lowlink is not used
as used in the original Tarjan's algorithm and many reference
implementations.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20240325202425.60930-13-kuniyu@amazon.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

bfdb0128

af_unix: Avoid Tarjan's algorithm if unnecessary. · ad081928

Kuniyuki Iwashima authored Mar 25, 2024

Once a cyclic reference is formed, we need to run GC to check if
there is dead SCC.

However, we do not need to run Tarjan's algorithm if we know that
the shape of the inflight graph has not been changed.

If an edge is added/updated/deleted and the edge's successor is
inflight, we set false to unix_graph_grouped, which means we need
to re-classify SCC.

Once we finalise SCC, we set true to unix_graph_grouped.

While unix_graph_grouped is true, we can iterate the grouped
SCC using vertex->scc_entry in unix_walk_scc_fast().

list_add() and list_for_each_entry_reverse() uses seem weird, but
they are to keep the vertex order consistent and make writing test
easier.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20240325202425.60930-12-kuniyu@amazon.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

ad081928

af_unix: Skip GC if no cycle exists. · 77e5593a

Kuniyuki Iwashima authored Mar 25, 2024

We do not need to run GC if there is no possible cyclic reference.
We use unix_graph_maybe_cyclic to decide if we should run GC.

If a fd of an AF_UNIX socket is passed to an already inflight AF_UNIX
socket, they could form a cyclic reference.  Then, we set true to
unix_graph_maybe_cyclic and later run Tarjan's algorithm to group
them into SCC.

Once we run Tarjan's algorithm, we are 100% sure whether cyclic
references exist or not.  If there is no cycle, we set false to
unix_graph_maybe_cyclic and can skip the entire garbage collection
next time.

When finalising SCC, we set true to unix_graph_maybe_cyclic if SCC
consists of multiple vertices.

Even if SCC is a single vertex, a cycle might exist as self-fd passing.
Given the corner case is rare, we detect it by checking all edges of
the vertex and set true to unix_graph_maybe_cyclic.

With this change, __unix_gc() is just a spin_lock() dance in the normal
usage.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20240325202425.60930-11-kuniyu@amazon.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

77e5593a

af_unix: Save O(n) setup of Tarjan's algo. · ba31b4a4

Kuniyuki Iwashima authored Mar 25, 2024

Before starting Tarjan's algorithm, we need to mark all vertices
as unvisited.  We can save this O(n) setup by reserving two special
indices (0, 1) and using two variables.

The first time we link a vertex to unix_unvisited_vertices, we set
unix_vertex_unvisited_index to index.

During DFS, we can see that the index of unvisited vertices is the
same as unix_vertex_unvisited_index.

When we finalise SCC later, we set unix_vertex_grouped_index to each
vertex's index.

Then, we can know (i) that the vertex is on the stack if the index
of a visited vertex is >= 2 and (ii) that it is not on the stack and
belongs to a different SCC if the index is unix_vertex_grouped_index.

After the whole algorithm, all indices of vertices are set as
unix_vertex_grouped_index.

Next time we start DFS, we know that all unvisited vertices have
unix_vertex_grouped_index, and we can use unix_vertex_unvisited_index
as the not-on-stack marker.

To use the same variable in __unix_walk_scc(), we can swap
unix_vertex_(grouped|unvisited)_index at the end of Tarjan's
algorithm.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20240325202425.60930-10-kuniyu@amazon.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

ba31b4a4

af_unix: Fix up unix_edge.successor for embryo socket. · dcf70df2

Kuniyuki Iwashima authored Mar 25, 2024

To garbage collect inflight AF_UNIX sockets, we must define the
cyclic reference appropriately.  This is a bit tricky if the loop
consists of embryo sockets.

Suppose that the fd of AF_UNIX socket A is passed to D and the fd B
to C and that C and D are embryo sockets of A and B, respectively.
It may appear that there are two separate graphs, A (-> D) and
B (-> C), but this is not correct.

     A --. .-- B
          X
     C <-' `-> D

Now, D holds A's refcount, and C has B's refcount, so unix_release()
will never be called for A and B when we close() them.  However, no
one can call close() for D and C to free skbs holding refcounts of A
and B because C/D is in A/B's receive queue, which should have been
purged by unix_release() for A and B.

So, here's another type of cyclic reference.  When a fd of an AF_UNIX
socket is passed to an embryo socket, the reference is indirectly held
by its parent listening socket.

  .-> A                            .-> B
  |   `- sk_receive_queue          |   `- sk_receive_queue
  |      `- skb                    |      `- skb
  |         `- sk == C             |         `- sk == D
  |            `- sk_receive_queue |           `- sk_receive_queue
  |               `- skb +---------'               `- skb +-.
  |                                                         |
  `---------------------------------------------------------'

Technically, the graph must be denoted as A <-> B instead of A (-> D)
and B (-> C) to find such a cyclic reference without touching each
socket's receive queue.

  .-> A --. .-- B <-.
  |        X        |  ==  A <-> B
  `-- C <-' `-> D --'

We apply this fixup during GC by fetching the real successor by
unix_edge_successor().

When we call accept(), we clear unix_sock.listener under unix_gc_lock
not to confuse GC.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20240325202425.60930-9-kuniyu@amazon.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

dcf70df2

af_unix: Save listener for embryo socket. · aed6ecef

Kuniyuki Iwashima authored Mar 25, 2024

This is a prep patch for the following change, where we need to
fetch the listening socket from the successor embryo socket
during GC.

We add a new field to struct unix_sock to save a pointer to a
listening socket.

We set it when connect() creates a new socket, and clear it when
accept() is called.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20240325202425.60930-8-kuniyu@amazon.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

aed6ecef

af_unix: Detect Strongly Connected Components. · 3484f063

Kuniyuki Iwashima authored Mar 25, 2024

In the new GC, we use a simple graph algorithm, Tarjan's Strongly
Connected Components (SCC) algorithm, to find cyclic references.

The algorithm visits every vertex exactly once using depth-first
search (DFS).

DFS starts by pushing an input vertex to a stack and assigning it
a unique number.  Two fields, index and lowlink, are initialised
with the number, but lowlink could be updated later during DFS.

If a vertex has an edge to an unvisited inflight vertex, we visit
it and do the same processing.  So, we will have vertices in the
stack in the order they appear and number them consecutively in
the same order.

If a vertex has a back-edge to a visited vertex in the stack,
we update the predecessor's lowlink with the successor's index.

After iterating edges from the vertex, we check if its index
equals its lowlink.

If the lowlink is different from the index, it shows there was a
back-edge.  Then, we go backtracking and propagate the lowlink to
its predecessor and resume the previous edge iteration from the
next edge.

If the lowlink is the same as the index, we pop vertices before
and including the vertex from the stack.  Then, the set of vertices
is SCC, possibly forming a cycle.  At the same time, we move the
vertices to unix_visited_vertices.

When we finish the algorithm, all vertices in each SCC will be
linked via unix_vertex.scc_entry.

Let's take an example.  We have a graph including five inflight
vertices (F is not inflight):

  A -> B -> C -> D -> E (-> F)
       ^         |
       `---------'

Suppose that we start DFS from C.  We will visit C, D, and B first
and initialise their index and lowlink.  Then, the stack looks like
this:

  > B = (3, 3)  (index, lowlink)
    D = (2, 2)
    C = (1, 1)

When checking B's edge to C, we update B's lowlink with C's index
and propagate it to D.

    B = (3, 1)  (index, lowlink)
  > D = (2, 1)
    C = (1, 1)

Next, we visit E, which has no edge to an inflight vertex.

  > E = (4, 4)  (index, lowlink)
    B = (3, 1)
    D = (2, 1)
    C = (1, 1)

When we leave from E, its index and lowlink are the same, so we
pop E from the stack as single-vertex SCC.  Next, we leave from
B and D but do nothing because their lowlink are different from
their index.

    B = (3, 1)  (index, lowlink)
    D = (2, 1)
  > C = (1, 1)

Then, we leave from C, whose index and lowlink are the same, so
we pop B, D and C as SCC.

Last, we do DFS for the rest of vertices, A, which is also a
single-vertex SCC.

Finally, each unix_vertex.scc_entry is linked as follows:

  A -.  B -> C -> D  E -.
  ^  |  ^         |  ^  |
  `--'  `---------'  `--'

We use SCC later to decide whether we can garbage-collect the
sockets.

Note that we still cannot detect SCC properly if an edge points
to an embryo socket.  The following two patches will sort it out.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20240325202425.60930-7-kuniyu@amazon.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

3484f063

af_unix: Iterate all vertices by DFS. · 6ba76fd2

Kuniyuki Iwashima authored Mar 25, 2024

The new GC will use a depth first search graph algorithm to find
cyclic references.  The algorithm visits every vertex exactly once.

Here, we implement the DFS part without recursion so that no one
can abuse it.

unix_walk_scc() marks every vertex unvisited by initialising index
as UNIX_VERTEX_INDEX_UNVISITED and iterates inflight vertices in
unix_unvisited_vertices and call __unix_walk_scc() to start DFS from
an arbitrary vertex.

__unix_walk_scc() iterates all edges starting from the vertex and
explores the neighbour vertices with DFS using edge_stack.

After visiting all neighbours, __unix_walk_scc() moves the visited
vertex to unix_visited_vertices so that unix_walk_scc() will not
restart DFS from the visited vertex.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20240325202425.60930-6-kuniyu@amazon.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

6ba76fd2

af_unix: Bulk update unix_tot_inflight/unix_inflight when queuing skb. · 22c3c0c5

Kuniyuki Iwashima authored Mar 25, 2024

Currently, we track the number of inflight sockets in two variables.
unix_tot_inflight is the total number of inflight AF_UNIX sockets on
the host, and user->unix_inflight is the number of inflight fds per
user.

We update them one by one in unix_inflight(), which can be done once
in batch. Also, sendmsg() could fail even after unix_inflight(), then
we need to acquire unix_gc_lock only to decrement the counters.

Let's bulk update the counters in unix_add_edges() and unix_del_edges(),
which is called only for successfully passed fds.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20240325202425.60930-5-kuniyu@amazon.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

22c3c0c5

af_unix: Link struct unix_edge when queuing skb. · 42f298c0

Kuniyuki Iwashima authored Mar 25, 2024

Just before queuing skb with inflight fds, we call scm_stat_add(),
which is a good place to set up the preallocated struct unix_vertex
and struct unix_edge in UNIXCB(skb).fp.

Then, we call unix_add_edges() and construct the directed graph
as follows:

  1. Set the inflight socket's unix_sock to unix_edge.predecessor.
  2. Set the receiver's unix_sock to unix_edge.successor.
  3. Set the preallocated vertex to inflight socket's unix_sock.vertex.
  4. Link inflight socket's unix_vertex.entry to unix_unvisited_vertices.
  5. Link unix_edge.vertex_entry to the inflight socket's unix_vertex.edges.

Let's say we pass the fd of AF_UNIX socket A to B and the fd of B
to C.  The graph looks like this:

  +-------------------------+
  | unix_unvisited_vertices | <-------------------------.
  +-------------------------+                           |
  +                                                     |
  |     +--------------+             +--------------+   |         +--------------+
  |     |  unix_sock A | <---. .---> |  unix_sock B | <-|-. .---> |  unix_sock C |
  |     +--------------+     | |     +--------------+   | | |     +--------------+
  | .-+ |    vertex    |     | | .-+ |    vertex    |   | | |     |    vertex    |
  | |   +--------------+     | | |   +--------------+   | | |     +--------------+
  | |                        | | |                      | | |
  | |   +--------------+     | | |   +--------------+   | | |
  | '-> |  unix_vertex |     | | '-> |  unix_vertex |   | | |
  |     +--------------+     | |     +--------------+   | | |
  `---> |    entry     | +---------> |    entry     | +-' | |
        |--------------|     | |     |--------------|     | |
        |    edges     | <-. | |     |    edges     | <-. | |
        +--------------+   | | |     +--------------+   | | |
                           | | |                        | | |
    .----------------------' | | .----------------------' | |
    |                        | | |                        | |
    |   +--------------+     | | |   +--------------+     | |
    |   |   unix_edge  |     | | |   |   unix_edge  |     | |
    |   +--------------+     | | |   +--------------+     | |
    `-> | vertex_entry |     | | `-> | vertex_entry |     | |
        |--------------|     | |     |--------------|     | |
        |  predecessor | +---' |     |  predecessor | +---' |
        |--------------|       |     |--------------|       |
        |   successor  | +-----'     |   successor  | +-----'
        +--------------+             +--------------+

Henceforth, we denote such a graph as A -> B (-> C).

Now, we can express all inflight fd graphs that do not contain
embryo sockets.  We will support the particular case later.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20240325202425.60930-4-kuniyu@amazon.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

42f298c0

af_unix: Allocate struct unix_edge for each inflight AF_UNIX fd. · 29b64e35

Kuniyuki Iwashima authored Mar 25, 2024

As with the previous patch, we preallocate to skb's scm_fp_list an
array of struct unix_edge in the number of inflight AF_UNIX fds.

There we just preallocate memory and do not use immediately because
sendmsg() could fail after this point.  The actual use will be in
the next patch.

When we queue skb with inflight edges, we will set the inflight
socket's unix_sock as unix_edge->predecessor and the receiver's
unix_sock as successor, and then we will link the edge to the
inflight socket's unix_vertex.edges.

Note that we set NULL to cloned scm_fp_list.edges in scm_fp_dup()
so that MSG_PEEK does not change the shape of the directed graph.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20240325202425.60930-3-kuniyu@amazon.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

29b64e35

af_unix: Allocate struct unix_vertex for each inflight AF_UNIX fd. · 1fbfdfaa

Kuniyuki Iwashima authored Mar 25, 2024

We will replace the garbage collection algorithm for AF_UNIX, where
we will consider each inflight AF_UNIX socket as a vertex and its file
descriptor as an edge in a directed graph.

This patch introduces a new struct unix_vertex representing a vertex
in the graph and adds its pointer to struct unix_sock.

When we send a fd using the SCM_RIGHTS message, we allocate struct
scm_fp_list to struct scm_cookie in scm_fp_copy().  Then, we bump
each refcount of the inflight fds' struct file and save them in
scm_fp_list.fp.

After that, unix_attach_fds() inexplicably clones scm_fp_list of
scm_cookie and sets it to skb.  (We will remove this part after
replacing GC.)

Here, we add a new function call in unix_attach_fds() to preallocate
struct unix_vertex per inflight AF_UNIX fd and link each vertex to
skb's scm_fp_list.vertices.

When sendmsg() succeeds later, if the socket of the inflight fd is
still not inflight yet, we will set the preallocated vertex to struct
unix_sock.vertex and link it to a global list unix_unvisited_vertices
under spin_lock(&unix_gc_lock).

If the socket is already inflight, we free the preallocated vertex.
This is to avoid taking the lock unnecessarily when sendmsg() could
fail later.

In the following patch, we will similarly allocate another struct
per edge, which will finally be linked to the inflight socket's
unix_vertex.edges.

And then, we will count the number of edges as unix_vertex.out_degree.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20240325202425.60930-2-kuniyu@amazon.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

1fbfdfaa

tcp/dccp: bypass empty buckets in inet_twsk_purge() · 50e2907e

Eric Dumazet authored Mar 27, 2024

TCP ehash table is often sparsely populated.

inet_twsk_purge() spends too much time calling cond_resched().

This patch can reduce time spent in inet_twsk_purge() by 20x.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20240327191206.508114-1-edumazet@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

50e2907e

net: dsa: hellcreek: Convert to gettimex64() · 385edf32

Kurt Kanzenbach authored Mar 26, 2024

As of commit 916444df ("ptp: deprecate gettime64() in favor of
gettimex64()") (new) PTP drivers should rather implement gettimex64().

In addition, this variant provides timestamps from the system clock. The
readings have to be recorded right before and after reading the lowest bits
of the PHC timestamp.
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

385edf32

net/smc: make smc_hash_sk/smc_unhash_sk static · af398bd0

Zhengchao Shao authored Mar 26, 2024

smc_hash_sk and smc_unhash_sk are only used in af_smc.c, so make them
static and remove the output symbol. They can be called under the path
.prot->hash()/unhash().
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Reviewed-by: Wen Gu <guwen@linux.alibaba.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

af398bd0

Merge branch 'net-sched-skip_sw' · 17593357

David S. Miller authored Mar 29, 2024

Asbjørn Sloth Tønnesen says:

====================
make skip_sw actually skip software

During development of flower-route[1], which I
recently presented at FOSDEM[2], I noticed that
CPU usage, would increase the more rules I installed
into the hardware for IP forwarding offloading.

Since we use TC flower offload for the hottest
prefixes, and leave the long tail to the normal (non-TC)
Linux network stack for slow-path IP forwarding.
We therefore need both the hardware and software
datapath to perform well.

I found that skip_sw rules, are quite expensive
in the kernel datapath, since they must be evaluated
and matched upon, before the kernel checks the
skip_sw flag.

This patchset optimizes the case where all rules
are skip_sw, by implementing a TC bypass for these
cases, where TC is only used as a control plane
for the hardware path.

v4:
- Rebased onto net-next, now that net-next is open again

v3: https://lore.kernel.org/netdev/20240306165813.656931-1-ast@fiberby.net/
- Patch 3:
  - Fix source_inline
  - Fix build failure, when CONFIG_NET_CLS without CONFIG_NET_CLS_ACT.

v2: https://lore.kernel.org/netdev/20240305144404.569632-1-ast@fiberby.net/
- Patch 1:
  - Add Reviewed-By from Jiri Pirko
- Patch 2:
  - Move code, to avoid forward declaration (Jiri).
- Patch 3
  - Refactor to use a static key.
  - Add performance data for trapping, or sending
    a packet to a non-existent chain (as suggested by Marcelo).

v1: https://lore.kernel.org/netdev/20240215160458.1727237-1-ast@fiberby.net/

[1] flower-route
    https://github.com/fiberby-dk/flower-route

[2] FOSDEM talk
    https://fosdem.org/2024/schedule/event/fosdem-2024-3337-flying-higher-hardware-offloading-with-bird/
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

17593357

net: sched: make skip_sw actually skip software · 047f340b

Asbjørn Sloth Tønnesen authored Mar 25, 2024

TC filters come in 3 variants:
- no flag (try to process in hardware, but fallback to software))
- skip_hw (do not process filter by hardware)
- skip_sw (do not process filter by software)

However skip_sw is implemented so that the skip_sw
flag can first be checked, after it has been matched.

IMHO it's common when using skip_sw, to use it on all rules.

So if all filters in a block is skip_sw filters, then
we can bail early, we can thus avoid having to match
the filters, just to check for the skip_sw flag.

This patch adds a bypass, for when only TC skip_sw rules
are used. The bypass is guarded by a static key, to avoid
harming other workloads.

There are 3 ways that a packet from a skip_sw ruleset, can
end up in the kernel path. Although the send packets to a
non-existent chain way is only improved a few percents, then
I believe it's worth optimizing the trap and fall-though
use-cases.

 +----------------------------+--------+--------+--------+
 | Test description           | Pre-   | Post-  | Rel.   |
 |                            | kpps   | kpps   | chg.   |
 +----------------------------+--------+--------+--------+
 | basic forwarding + notrack | 3589.3 | 3587.9 |  1.00x |
 | switch to eswitch mode     | 3081.8 | 3094.7 |  1.00x |
 | add ingress qdisc          | 3042.9 | 3063.6 |  1.01x |
 | tc forward in hw / skip_sw |37024.7 |37028.4 |  1.00x |
 | tc forward in sw / skip_hw | 3245.0 | 3245.3 |  1.00x |
 +----------------------------+--------+--------+--------+
 | tests with only skip_sw rules below:                  |
 +----------------------------+--------+--------+--------+
 | 1 non-matching rule        | 2694.7 | 3058.7 |  1.14x |
 | 1 n-m rule, match trap     | 2611.2 | 3323.1 |  1.27x |
 | 1 n-m rule, goto non-chain | 2886.8 | 2945.9 |  1.02x |
 | 5 non-matching rules       | 1958.2 | 3061.3 |  1.56x |
 | 5 n-m rules, match trap    | 1911.9 | 3327.0 |  1.74x |
 | 5 n-m rules, goto non-chain| 2883.1 | 2947.5 |  1.02x |
 | 10 non-matching rules      | 1466.3 | 3062.8 |  2.09x |
 | 10 n-m rules, match trap   | 1444.3 | 3317.9 |  2.30x |
 | 10 n-m rules,goto non-chain| 2883.1 | 2939.5 |  1.02x |
 | 25 non-matching rules      |  838.5 | 3058.9 |  3.65x |
 | 25 n-m rules, match trap   |  824.5 | 3323.0 |  4.03x |
 | 25 n-m rules,goto non-chain| 2875.8 | 2944.7 |  1.02x |
 | 50 non-matching rules      |  488.1 | 3054.7 |  6.26x |
 | 50 n-m rules, match trap   |  484.9 | 3318.5 |  6.84x |
 | 50 n-m rules,goto non-chain| 2884.1 | 2939.7 |  1.02x |
 +----------------------------+--------+--------+--------+

perf top (25 n-m skip_sw rules - pre patch):
  20.39%  [kernel]  [k] __skb_flow_dissect
  16.43%  [kernel]  [k] rhashtable_jhash2
  10.58%  [kernel]  [k] fl_classify
  10.23%  [kernel]  [k] fl_mask_lookup
   4.79%  [kernel]  [k] memset_orig
   2.58%  [kernel]  [k] tcf_classify
   1.47%  [kernel]  [k] __x86_indirect_thunk_rax
   1.42%  [kernel]  [k] __dev_queue_xmit
   1.36%  [kernel]  [k] nft_do_chain
   1.21%  [kernel]  [k] __rcu_read_lock

perf top (25 n-m skip_sw rules - post patch):
   5.12%  [kernel]  [k] __dev_queue_xmit
   4.77%  [kernel]  [k] nft_do_chain
   3.65%  [kernel]  [k] dev_gro_receive
   3.41%  [kernel]  [k] check_preemption_disabled
   3.14%  [kernel]  [k] mlx5e_skb_from_cqe_mpwrq_nonlinear
   2.88%  [kernel]  [k] __netif_receive_skb_core.constprop.0
   2.49%  [kernel]  [k] mlx5e_xmit
   2.15%  [kernel]  [k] ip_forward
   1.95%  [kernel]  [k] mlx5e_tc_restore_tunnel
   1.92%  [kernel]  [k] vlan_gro_receive

Test setup:
 DUT: Intel Xeon D-1518 (2.20GHz) w/ Nvidia/Mellanox ConnectX-6 Dx 2x100G
 Data rate measured on switch (Extreme X690), and DUT connected as
 a router on a stick, with pktgen and pktsink as VLANs.
 Pktgen-dpdk was in range 36.6-37.7 Mpps 64B packets across all tests.
 Full test data at https://files.fiberby.net/ast/2024/tc_skip_sw/v2_tests/Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

047f340b

net: sched: cls_api: add filter counter · 2081fd34

Asbjørn Sloth Tønnesen authored Mar 25, 2024

Maintain a count of filters per block.

Counter updates are protected by cb_lock, which is
also used to protect the offload counters.
Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2081fd34