Commits · 63173885ccb253ff76e1da5cb148d1ca4b69a019 · Kirill Smelkov / linux

26 Jun, 2024 9 commits

Merge branch 'ethtool-provide-the-dim-profile-fine-tuning-channel' · 63173885

Jakub Kicinski authored Jun 25, 2024

Heng Qi says:

====================
ethtool: provide the dim profile fine-tuning channel

The NetDIM library provides excellent acceleration for many modern
network cards. However, the default profiles of DIM limits its maximum
capabilities for different NICs, so providing a way which the NIC can
be custom configured is necessary.

Currently, the way is based on the commonly used "ethtool -C".

For example,
on the server side, the virtio-net NIC with rx dim enabled has 8
queues and runs nginx.
The client uses the following command to send traffic to the server:
  ./wrk http://server_ip:80 -c 64 -t 5 -d 30

Then adjust the default rx-profile for server dim to

  {.usec =   1, .pkts = 256, .comps = n/a,},
  {.usec =   8, .pkts = 256, .comps = n/a,},
  {.usec =  30, .pkts = 256, .comps = n/a,},
  {.usec =  64, .pkts = 256, .comps = n/a,},
  {.usec = 128, .pkts = 256, .comps = n/a,}

The server PPS is improved by 20%+.
====================

Link: https://patch.msgid.link/20240621101353.107425-1-hengqi@linux.alibaba.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

63173885

virtio-net: support dim profile fine-tuning · dcb67f6a

Heng Qi authored Jun 21, 2024

Virtio-net has different types of back-end device implementations.
In order to effectively optimize the dim library's gains for different
device implementations, let's use the new interface params to
initialize and query dim results from a customized profile list.
Signed-off-by: Heng Qi <hengqi@linux.alibaba.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20240621101353.107425-6-hengqi@linux.alibaba.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

dcb67f6a

dim: add new interfaces for initialization and getting results · 13ba28c5

Heng Qi authored Jun 21, 2024

DIM-related mode and work have been collected in one same place,
so new interfaces are added to provide convenience.
Signed-off-by: Heng Qi <hengqi@linux.alibaba.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20240621101353.107425-5-hengqi@linux.alibaba.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

13ba28c5

ethtool: provide customized dim profile management · f750dfe8

Heng Qi authored Jun 21, 2024

The NetDIM library, currently leveraged by an array of NICs, delivers
excellent acceleration benefits. Nevertheless, NICs vary significantly
in their dim profile list prerequisites.

Specifically, virtio-net backends may present diverse sw or hw device
implementation, making a one-size-fits-all parameter list impractical.
On Alibaba Cloud, the virtio DPU's performance under the default DIM
profile falls short of expectations, partly due to a mismatch in
parameter configuration.

I also noticed that ice/idpf/ena and other NICs have customized
profilelist or placed some restrictions on dim capabilities.

Motivated by this, I tried adding new params for "ethtool -C" that provides
a per-device control to modify and access a device's interrupt parameters.

Usage
========
The target NIC is named ethx.

Assume that ethx only declares support for rx profile setting
(with DIM_PROFILE_RX flag set in profile_flags) and supports modification
of usec and pkt fields.

1. Query the currently customized list of the device

$ ethtool -c ethx
...
rx-profile:
{.usec =   1, .pkts = 256, .comps = n/a,},
{.usec =   8, .pkts = 256, .comps = n/a,},
{.usec =  64, .pkts = 256, .comps = n/a,},
{.usec = 128, .pkts = 256, .comps = n/a,},
{.usec = 256, .pkts = 256, .comps = n/a,}
tx-profile:   n/a

2. Tune
$ ethtool -C ethx rx-profile 1,1,n_2,n,n_3,3,n_4,4,n_n,5,n
"n" means do not modify this field.
$ ethtool -c ethx
...
rx-profile:
{.usec =   1, .pkts =   1, .comps = n/a,},
{.usec =   2, .pkts = 256, .comps = n/a,},
{.usec =   3, .pkts =   3, .comps = n/a,},
{.usec =   4, .pkts =   4, .comps = n/a,},
{.usec = 256, .pkts =   5, .comps = n/a,}
tx-profile:   n/a

3. Hint
If the device does not support some type of customized dim profiles,
the corresponding "n/a" will display.

If the "n/a" field is being modified, -EOPNOTSUPP will be reported.
Signed-off-by: Heng Qi <hengqi@linux.alibaba.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20240621101353.107425-4-hengqi@linux.alibaba.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

f750dfe8

dim: make DIMLIB dependent on NET · b65e697a

Heng Qi authored Jun 21, 2024

DIMLIB's capabilities are supplied by the dim, net_dim, and
rdma_dim objects, and dim's interfaces solely act as a base for
net_dim and rdma_dim and are not explicitly used anywhere else.
rdma_dim is utilized by the infiniband driver, while net_dim
is for network devices, excluding the soc/fsl driver.

In this patch, net_dim relies on some NET's interfaces, thus
DIMLIB needs to explicitly depend on the NET Kconfig.

The soc/fsl driver uses the functions provided by net_dim, so
it also needs to depend on NET.
Signed-off-by: Heng Qi <hengqi@linux.alibaba.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20240621101353.107425-3-hengqi@linux.alibaba.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

b65e697a

linux/dim: move useful macros to .h file · 0e942053

Heng Qi authored Jun 21, 2024

Useful macros will be used effectively elsewhere.
These will be utilized in subsequent patches.
Signed-off-by: Heng Qi <hengqi@linux.alibaba.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20240621101353.107425-2-hengqi@linux.alibaba.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

0e942053

Merge branch 'ravb-add-mii-support-for-r-car-v4m' · c84f9324

Jakub Kicinski authored Jun 25, 2024

Geert Uytterhoeven says:

====================
ravb: Add MII support for R-Car V4M

All EtherAVB instances on R-Car Gen3/Gen4 SoCs support the RGMII
interface.  In addition, the first two EtherAVB instances on R-Car V4M
also support the MII interface, but this is not yet supported by the
driver.  This patch series adds support for MII on R-Car Gen4, after the
customary cleanup.

The corresponding pin control support is available in [1].

Compile-tested only, as all AVB interfaces on the Gray Hawk Single
development board are connected to RGMII PHYs.
No regressions on R-Car V4H.

[1] "[PATCH/RFC] pinctrl: renesas: r8a779h0: Add AVB MII pins and groups"
    https://lore.kernel.org/4a0a12227f2145ef53b18bc08f45b19dcd745fc6.1718378739.git.geert+renesas@glider.be/

v1: https://lore.kernel.org/f0ef3e00aec461beb33869ab69ccb44a23d78f51.1718378166.git.geert+renesas@glider.be
====================

Link: https://patch.msgid.link/cover.1719234830.git.geert+renesas@glider.beSigned-off-by: Jakub Kicinski <kuba@kernel.org>

c84f9324

ravb: Add MII support for R-Car V4M · 6e0713cc

Geert Uytterhoeven authored Jun 24, 2024

All EtherAVB instances on R-Car Gen3/Gen4 SoCs support the RGMII
interface. In addition, the first two EtherAVB instances on R-Car V4M
also support the MII interface, but this is not yet supported by the
driver.

Add support for MII on R-Car Gen4 by adding an R-Car Gen4-specific EMAC
initialization function that selects the MII clock instead of the RGMII
clock when the PHY interface is MII. Note that all implementations of
EtherAVB on R-Car Gen4 SoCs have the APSR register, but only MII-capable
instances are documented to have the MIISELECT bit, which has a
documented value of zero when reserved.
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Link: https://patch.msgid.link/3a21d1d6680864aa85afff9260234c2b8054020a.1719234830.git.geert+renesas@glider.beSigned-off-by: Jakub Kicinski <kuba@kernel.org>

6e0713cc

ravb: Improve ravb_hw_info instance order · 8d653d26

Geert Uytterhoeven authored Jun 24, 2024

Move ravb_gen2_hw_info before ravb_gen3_hw_info to match
ravb_match_table[] order.
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Link: https://patch.msgid.link/a76febe3737e26365a784e9193da9363f22aa550.1719234830.git.geert+renesas@glider.beSigned-off-by: Jakub Kicinski <kuba@kernel.org>

8d653d26

25 Jun, 2024 21 commits

virtio_net: Remove u64_stats_update_begin()/end() for stats fetch · d891317f

Li RongQing authored Jun 21, 2024

This place is fetching the stats, u64_stats_update_begin()/end()
should not be used, and the fetcher of stats is in the same context
as the updater of the stats, so don't need any protection
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Link: https://lore.kernel.org/20240621094552.53469-1-lirongqing@baidu.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

d891317f

selftests: net: remove unneeded IP_GRE config · c4532232

Yujie Liu authored Jun 24, 2024

It seems that there is no definition for config IP_GRE, and it is not a
dependency of other configs, so remove it.

linux$ find -name Kconfig | xargs grep "IP_GRE"
<-- nothing

There is a IPV6_GRE config defined in net/ipv6/Kconfig. It only depends
on NET_IPGRE_DEMUX but not IP_GRE.
Signed-off-by: Yujie Liu <yujie.liu@intel.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20240624055539.2092322-1-yujie.liu@intel.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

c4532232

l2tp: remove incorrect __rcu attribute · a8a8d89d

James Chapman authored Jun 24, 2024

This fixes a sparse warning.

Fixes: d18d3f0a ("l2tp: replace hlist with simple list for per-tunnel session list")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202406220754.evK8Hrjw-lkp@intel.com/Signed-off-by: James Chapman <jchapman@katalix.com>
Link: https://patch.msgid.link/20240624082945.1925009-1-jchapman@katalix.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

a8a8d89d

net: ethernet: mtk_eth_soc: ppe: prevent ppe update for non-mtk devices · 73cfd947

Elad Yifee authored Jun 23, 2024

Introduce an additional validation to ensure that the PPE index
is modified exclusively for mtk_eth ingress devices.
This primarily addresses the issue related
to WED operation with multiple PPEs.

Fixes: dee4dd10 ("net: ethernet: mtk_eth_soc: ppe: add support for multiple PPEs")
Signed-off-by: Elad Yifee <eladwf@gmail.com>
Link: https://lore.kernel.org/r/20240623175113.24437-1-eladwf@gmail.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>

73cfd947

Merge branch 'net-macb-wol-enhancements' · 1d706875

Paolo Abeni authored Jun 25, 2024

Vineeth Karumanchi says:

====================
net: macb: WOL enhancements

- Add provisioning for queue tie-off and queue disable during suspend.
- Add support for ARP packet types to WoL.
- Advertise WoL attributes by default.
- Extend MACB supported WoL modes to the PHY supported WoL modes.
- Deprecate magic-packet property.

v6: https://lore.kernel.org/netdev/20240617070413.2291511-1-vineeth.karumanchi@amd.com/
v5: https://lore.kernel.org/netdev/20240611162827.887162-1-vineeth.karumanchi@amd.com/
v4: https://lore.kernel.org/lkml/20240610053936.622237-1-vineeth.karumanchi@amd.com/
v3: https://lore.kernel.org/netdev/20240605102457.4050539-1-vineeth.karumanchi@amd.com/
v2: https://lore.kernel.org/netdev/20240222153848.2374782-1-vineeth.karumanchi@amd.com/
v1: https://lore.kernel.org/lkml/20240130104845.3995341-1-vineeth.karumanchi@amd.com/#t
====================

Link: https://lore.kernel.org/r/20240621045735.3031357-1-vineeth.karumanchi@amd.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>

1d706875

dt-bindings: net: cdns,macb: Deprecate magic-packet property · 783bfe27

Vineeth Karumanchi authored Jun 21, 2024

WOL modes such as magic-packet should be an OS policy.
By default, advertise supported modes and use ethtool to activate
the required mode.
Suggested-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Vineeth Karumanchi <vineeth.karumanchi@amd.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

783bfe27

net: macb: Add ARP support to WOL · 0cb8de39

Vineeth Karumanchi authored Jun 21, 2024

Extend wake-on LAN support with an ARP packet.

Currently, if PHY supports WOL, ethtool ignores the modes supported
by MACB. This change extends the WOL modes with MACB supported modes.

Advertise wake-on LAN supported modes by default without relying on
dt node. By default, wake-on LAN will be in disabled state.
Using ethtool, users can enable/disable or choose packet types.

For wake-on LAN via ARP, ensure the IP address is assigned and
report an error otherwise.
Co-developed-by: Harini Katakam <harini.katakam@amd.com>
Signed-off-by: Harini Katakam <harini.katakam@amd.com>
Signed-off-by: Vineeth Karumanchi <vineeth.karumanchi@amd.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Claudiu Beznea <claudiu.beznea@tuxon.dev>
Tested-by: Claudiu Beznea <claudiu.beznea@tuxon.dev> # on SAMA7G5
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

0cb8de39

net: macb: Enable queue disable · 3650a8cc

Vineeth Karumanchi authored Jun 21, 2024

Enable queue disable for Versal devices.
Signed-off-by: Vineeth Karumanchi <vineeth.karumanchi@amd.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Claudiu Beznea <claudiu.beznea@tuxon.dev>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

3650a8cc

net: macb: queue tie-off or disable during WOL suspend · 759cc793

Vineeth Karumanchi authored Jun 21, 2024

When GEM is used as a wake device, it is not mandatory for the RX DMA
to be active. The RX engine in IP only needs to receive and identify
a wake packet through an interrupt. The wake packet is of no further
significance; hence, it is not required to be copied into memory.
By disabling RX DMA during suspend, we can avoid unnecessary DMA
processing of any incoming traffic.

During suspend, perform either of the below operations:

- tie-off/dummy descriptor: Disable unused queues by connecting
  them to a looped descriptor chain without free slots.

- queue disable: The newer IP version allows disabling individual queues.
Co-developed-by: Harini Katakam <harini.katakam@amd.com>
Signed-off-by: Harini Katakam <harini.katakam@amd.com>
Signed-off-by: Vineeth Karumanchi <vineeth.karumanchi@amd.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Claudiu Beznea <claudiu.beznea@tuxon.dev>
Tested-by: Claudiu Beznea <claudiu.beznea@tuxon.dev> # on SAMA7G5
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

759cc793

Merge branch 'af_unix-remove-spin_lock_nested-and-convert-to-lock_cmp_fn' · 7e7c714a

Paolo Abeni authored Jun 25, 2024

Kuniyuki Iwashima says:

====================
af_unix: Remove spin_lock_nested() and convert to lock_cmp_fn.

This series removes spin_lock_nested() in AF_UNIX and instead
defines the locking orders as functions tied to each lock by
lockdep_set_lock_cmp_fn().

When the defined function returns a negative value, lockdep
considers it will not cause deadlock.  (See ->cmp_fn() in
check_deadlock() and check_prev_add().)

When we cannot define the total ordering, we return -1 for
the allowed ordering and otherwise 0 as undefined. [0]

[0]: https://lore.kernel.org/netdev/thzkgbuwuo3knevpipu4rzsh5qgmwhklihypdgziiruabvh46f@uwdkpcfxgloo/

Changes:
  v4:
    * Patch 4
      * Make unix_state_lock_cmp_fn() symmetric.

  v3: https://lore.kernel.org/netdev/20240614200715.93150-1-kuniyu@amazon.com/
    * Patch 3
      * Cache sk->sk_state
      * s/unix_state_lock()/unix_state_unlock()/
    * Patch 8
      * Add embryo -> listener locking order

  v2: https://lore.kernel.org/netdev/20240611222905.34695-1-kuniyu@amazon.com/
   * Patch 1 & 2
      * Use (((l) > (r)) - ((l) < (r))) for comparison

  v1: https://lore.kernel.org/netdev/20240610223501.73191-1-kuniyu@amazon.com/
====================

Link: https://lore.kernel.org/r/20240620205623.60139-1-kuniyu@amazon.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>

7e7c714a

af_unix: Don't use spin_lock_nested() in copy_peercred(). · 22e5751b

Kuniyuki Iwashima authored Jun 20, 2024

When (AF_UNIX, SOCK_STREAM) socket connect()s to a listening socket,
the listener's sk_peer_pid/sk_peer_cred are copied to the client in
copy_peercred().

Then, two sk_peer_locks are held there; one is client's and another
is listener's.

However, the latter is not needed because we hold the listner's
unix_state_lock() there and unix_listen() cannot update the cred
concurrently.

Let's drop the unnecessary spin_lock() and use the bare spin_lock()
for the client to protect concurrent read by getsockopt(SO_PEERCRED).
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

22e5751b

af_unix: Remove put_pid()/put_cred() in copy_peercred(). · e4bd881d

Kuniyuki Iwashima authored Jun 20, 2024

When (AF_UNIX, SOCK_STREAM) socket connect()s to a listening socket,
the listener's sk_peer_pid/sk_peer_cred are copied to the client in
copy_peercred().

Then, the client's sk_peer_pid and sk_peer_cred are always NULL, so
we need not call put_pid() and put_cred() there.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

e4bd881d

af_unix: Set sk_peer_pid/sk_peer_cred locklessly for new socket. · faf489e6

Kuniyuki Iwashima authored Jun 20, 2024

init_peercred() is called in 3 places:

  1. socketpair() : both sockets
  2. connect()    : child socket
  3. listen()     : listening socket

The first two need not hold sk_peer_lock because no one can
touch the socket.

Let's set cred/pid without holding lock for the two cases and
rename the old init_peercred() to update_peercred() to properly
reflect the use case.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

faf489e6

af_unix: Define locking order for U_RECVQ_LOCK_EMBRYO in unix_collect_skb(). · 8647ece4

Kuniyuki Iwashima authored Jun 20, 2024

While GC is cleaning up cyclic references by SCM_RIGHTS,
unix_collect_skb() collects skb in the socket's recvq.

If the socket is TCP_LISTEN, we need to collect skb in the
embryo's queue.  Then, both the listener's recvq lock and
the embroy's one are held.

The locking is always done in the listener -> embryo order.

Let's define it as unix_recvq_lock_cmp_fn() instead of using
spin_lock_nested().

Note that the reverse order is defined for consistency.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

8647ece4

af_unix: Remove U_LOCK_GC_LISTENER. · 7202cb59

Kuniyuki Iwashima authored Jun 20, 2024

Commit 1971d13f ("af_unix: Suppress false-positive lockdep splat for
spin_lock() in __unix_gc().") added U_LOCK_GC_LISTENER for the old GC,
but it's no longer needed for the new GC.

Let's remove U_LOCK_GC_LISTENER and unix_state_lock_nested() as there's
no user.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

7202cb59

af_unix: Remove U_LOCK_DIAG. · c4da4661

Kuniyuki Iwashima authored Jun 20, 2024

sk_diag_dump_icons() acquires embryo's lock by unix_state_lock_nested()
to fetch its peer.

The embryo's ->peer is set to NULL only when its parent listener is
close()d.  Then, unix_release_sock() is called for each embryo after
unlinking skb by skb_dequeue().

In sk_diag_dump_icons(), we hold the parent's recvq lock, so we need
not acquire unix_state_lock_nested(), and peer is always non-NULL.

Let's remove unnecessary unix_state_lock_nested() and non-NULL test
for peer.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

c4da4661

af_unix: Don't acquire unix_state_lock() for sock_i_ino(). · b380b181

Kuniyuki Iwashima authored Jun 20, 2024

sk_diag_dump_peer() and sk_diag_dump() call unix_state_lock() for
sock_i_ino() which reads SOCK_INODE(sk->sk_socket)->i_ino, but it's
protected by sk->sk_callback_lock.

Let's remove unnecessary unix_state_lock().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

b380b181

af_unix: Define locking order for U_LOCK_SECOND in unix_stream_connect(). · 98f706de

Kuniyuki Iwashima authored Jun 20, 2024

While a SOCK_(STREAM|SEQPACKET) socket connect()s to another, we hold
two locks of them by unix_state_lock() and unix_state_lock_nested() in
unix_stream_connect().

Before unix_state_lock_nested(), the following is guaranteed by checking
sk->sk_state:

  1. The first socket is TCP_LISTEN
  2. The second socket is not the first one
  3. Simultaneous connect() must fail

So, the client state can be TCP_CLOSE or TCP_LISTEN or TCP_ESTABLISHED.

Let's define the expected states as unix_state_lock_cmp_fn() instead of
using unix_state_lock_nested().

Note that 2. is detected by debug_spin_lock_before() and 3. cannot be
expressed as lock_cmp_fn.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

98f706de

af_unix: Don't retry after unix_state_lock_nested() in unix_stream_connect(). · 1ca27e0c

Kuniyuki Iwashima authored Jun 20, 2024

When a SOCK_(STREAM|SEQPACKET) socket connect()s to another one, we need
to lock the two sockets to check their states in unix_stream_connect().

We use unix_state_lock() for the server and unix_state_lock_nested() for
client with tricky sk->sk_state check to avoid deadlock.

The possible deadlock scenario are the following:

  1) Self connect()
  2) Simultaneous connect()

The former is simple, attempt to grab the same lock, and the latter is
AB-BA deadlock.

After the server's unix_state_lock(), we check the server socket's state,
and if it's not TCP_LISTEN, connect() fails with -EINVAL.

Then, we avoid the former deadlock by checking the client's state before
unix_state_lock_nested().  If its state is not TCP_LISTEN, we can make
sure that the client and the server are not identical based on the state.

Also, the latter deadlock can be avoided in the same way.  Due to the
server sk->sk_state requirement, AB-BA deadlock could happen only with
TCP_LISTEN sockets.  So, if the client's state is TCP_LISTEN, we can
give up the second lock to avoid the deadlock.

  CPU 1                 CPU 2                  CPU 3
  connect(A -> B)       connect(B -> A)        listen(A)
  ---                   ---                    ---
  unix_state_lock(B)
  B->sk_state == TCP_LISTEN
  READ_ONCE(A->sk_state) == TCP_CLOSE
                            ^^^^^^^^^
                            ok, will lock A    unix_state_lock(A)
             .--------------'                  WRITE_ONCE(A->sk_state, TCP_LISTEN)
             |                                 unix_state_unlock(A)
             |
             |          unix_state_lock(A)
             |          A->sk_sk_state == TCP_LISTEN
             |          READ_ONCE(B->sk_state) == TCP_LISTEN
             v                                    ^^^^^^^^^^
  unix_state_lock_nested(A)                       Don't lock B !!

Currently, while checking the client's state, we also check if it's
TCP_ESTABLISHED, but this is unlikely and can be checked after we know
the state is not TCP_CLOSE.

Moreover, if it happens after the second lock, we now jump to the restart
label, but it's unlikely that the server is not found during the retry,
so the jump is mostly to revist the client state check.

Let's remove the retry logic and check the state against TCP_CLOSE first.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

1ca27e0c

af_unix: Define locking order for U_LOCK_SECOND in unix_state_double_lock(). · ed998228

Kuniyuki Iwashima authored Jun 20, 2024

unix_dgram_connect() and unix_dgram_{send,recv}msg() lock the socket
and peer in ascending order of the socket address.

Let's define the order as unix_state_lock_cmp_fn() instead of using
unix_state_lock_nested().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ed998228

af_unix: Define locking order for unix_table_double_lock(). · 3955802f

Kuniyuki Iwashima authored Jun 20, 2024

When created, AF_UNIX socket is put into net->unx.table.buckets[],
and the hash is stored in sk->sk_hash.

  * unbound socket  : 0 <= sk_hash <= UNIX_HASH_MOD

When bind() is called, the socket could be moved to another bucket.

  * pathname socket : 0 <= sk_hash <= UNIX_HASH_MOD
  * abstract socket : UNIX_HASH_MOD + 1 <= sk_hash <= UNIX_HASH_MOD * 2 + 1

Then, we call unix_table_double_lock() which locks a single bucket
or two.

Let's define the order as unix_table_lock_cmp_fn() instead of using
spin_lock_nested().

The locking is always done in ascending order of sk->sk_hash, which
is the index of buckets/locks array allocated by kvmalloc_array().

  sk_hash_A < sk_hash_B
  <=> &locks[sk_hash_A].dep_map < &locks[sk_hash_B].dep_map

So, the relation of two sk->sk_hash can be derived from the addresses
of dep_map in the array of locks.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

3955802f

24 Jun, 2024 10 commits

Merge branch 'locking-introduce-nested-bh-locking' · bf2468f9

Jakub Kicinski authored Jun 24, 2024

Sebastian Andrzej Siewior says:

====================
locking: Introduce nested-BH locking.

Disabling bottoms halves acts as per-CPU BKL. On PREEMPT_RT code within
local_bh_disable() section remains preemtible. As a result high prior
tasks (or threaded interrupts) will be blocked by lower-prio task (or
threaded interrupts) which are long running which includes softirq
sections.

The proposed way out is to introduce explicit per-CPU locks for
resources which are protected by local_bh_disable() and use those only
on PREEMPT_RT so there is no additional overhead for !PREEMPT_RT builds.

The series introduces the infrastructure and converts large parts of
networking which is largest stake holder here. Once this done the
per-CPU lock from local_bh_disable() on PREEMPT_RT can be lifted.

Performance testing. Baseline is net-next as of commit 93bda330
("Merge branch'net-constify-ctl_table-arguments-of-utility-functions'")
plus v6.10-rc1. A 10GiG link is used between two hosts. The command
   xdp-bench redirect-cpu --cpu 3 --remote-action drop eth1 -e

was invoked on the receiving side with a ixgbe. The sending side uses
pktgen_sample03_burst_single_flow.sh on i40e.

Baseline:
| eth1->?                 9,018,604 rx/s                  0 err,drop/s
|   receive total         9,018,604 pkt/s                 0 drop/s                0 error/s
|     cpu:7               9,018,604 pkt/s                 0 drop/s                0 error/s
|   enqueue to cpu 3      9,018,602 pkt/s                 0 drop/s             7.00 bulk-avg
|     cpu:7->3            9,018,602 pkt/s                 0 drop/s             7.00 bulk-avg
|   kthread total         9,018,606 pkt/s                 0 drop/s          214,698 sched
|     cpu:3               9,018,606 pkt/s                 0 drop/s          214,698 sched
|     xdp_stats                   0 pass/s        9,018,606 drop/s                0 redir/s
|       cpu:3                     0 pass/s        9,018,606 drop/s                0 redir/s
|   redirect_err                  0 error/s
|   xdp_exception                 0 hit/s

perf top --sort cpu,symbol --no-children:
|   18.14%  007  [k] bpf_prog_4f0ffbb35139c187_cpumap_l4_hash
|   13.29%  007  [k] ixgbe_poll
|   12.66%  003  [k] cpu_map_kthread_run
|    7.23%  003  [k] page_frag_free
|    6.76%  007  [k] xdp_do_redirect
|    3.76%  007  [k] cpu_map_redirect
|    3.13%  007  [k] bq_flush_to_queue
|    2.51%  003  [k] xdp_return_frame
|    1.93%  007  [k] try_to_wake_up
|    1.78%  007  [k] _raw_spin_lock
|    1.74%  007  [k] cpu_map_enqueue
|    1.56%  003  [k] bpf_prog_57cd311f2e27366b_cpumap_drop

With this series applied:
| eth1->?                10,329,340 rx/s                  0 err,drop/s
|   receive total        10,329,340 pkt/s                 0 drop/s                0 error/s
|     cpu:6              10,329,340 pkt/s                 0 drop/s                0 error/s
|   enqueue to cpu 3     10,329,338 pkt/s                 0 drop/s             8.00 bulk-avg
|     cpu:6->3           10,329,338 pkt/s                 0 drop/s             8.00 bulk-avg
|   kthread total        10,329,321 pkt/s                 0 drop/s           96,297 sched
|     cpu:3              10,329,321 pkt/s                 0 drop/s           96,297 sched
|     xdp_stats                   0 pass/s       10,329,321 drop/s                0 redir/s
|       cpu:3                     0 pass/s       10,329,321 drop/s                0 redir/s
|   redirect_err                  0 error/s
|   xdp_exception                 0 hit/s

perf top --sort cpu,symbol --no-children:
|   20.90%  006  [k] bpf_prog_4f0ffbb35139c187_cpumap_l4_hash
|   12.62%  006  [k] ixgbe_poll
|    9.82%  003  [k] page_frag_free
|    8.73%  003  [k] cpu_map_bpf_prog_run_xdp
|    6.63%  006  [k] xdp_do_redirect
|    4.94%  003  [k] cpu_map_kthread_run
|    4.28%  006  [k] cpu_map_redirect
|    4.03%  006  [k] bq_flush_to_queue
|    3.01%  003  [k] xdp_return_frame
|    1.95%  006  [k] _raw_spin_lock
|    1.94%  003  [k] bpf_prog_57cd311f2e27366b_cpumap_drop

This diff appears to be noise.

v8: https://lore.kernel.org/all/20240619072253.504963-1-bigeasy@linutronix.de
v7: https://lore.kernel.org/all/20240618072526.379909-1-bigeasy@linutronix.de
v6: https://lore.kernel.org/all/20240612170303.3896084-1-bigeasy@linutronix.de
v5: https://lore.kernel.org/all/20240607070427.1379327-1-bigeasy@linutronix.de
v4: https://lore.kernel.org/all/20240604154425.878636-1-bigeasy@linutronix.de
v3: https://lore.kernel.org/all/20240529162927.403425-1-bigeasy@linutronix.de
v2: https://lore.kernel.org/all/20240503182957.1042122-1-bigeasy@linutronix.de
v1: https://lore.kernel.org/all/20231215171020.687342-1-bigeasy@linutronix.de
====================

Link: https://patch.msgid.link/20240620132727.660738-1-bigeasy@linutronix.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>

bf2468f9

net: Move per-CPU flush-lists to bpf_net_context on PREEMPT_RT. · 3f9fe37d

Sebastian Andrzej Siewior authored Jun 20, 2024

The per-CPU flush lists, which are accessed from within the NAPI callback
(xdp_do_flush() for instance), are per-CPU. There are subject to the
same problem as struct bpf_redirect_info.

Add the per-CPU lists cpu_map_flush_list, dev_map_flush_list and
xskmap_map_flush_list to struct bpf_net_context. Add wrappers for the
access. The lists initialized on first usage (similar to
bpf_net_ctx_get_ri()).

Cc: "Björn Töpel" <bjorn@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Eduard Zingerman <eddyz87@gmail.com>
Cc: Hao Luo <haoluo@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@intel.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Song Liu <song@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>
Cc: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20240620132727.660738-16-bigeasy@linutronix.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>

3f9fe37d

net: Reference bpf_redirect_info via task_struct on PREEMPT_RT. · 401cb7da

Sebastian Andrzej Siewior authored Jun 20, 2024

The XDP redirect process is two staged:
- bpf_prog_run_xdp() is invoked to run a eBPF program which inspects the
  packet and makes decisions. While doing that, the per-CPU variable
  bpf_redirect_info is used.

- Afterwards xdp_do_redirect() is invoked and accesses bpf_redirect_info
  and it may also access other per-CPU variables like xskmap_flush_list.

At the very end of the NAPI callback, xdp_do_flush() is invoked which
does not access bpf_redirect_info but will touch the individual per-CPU
lists.

The per-CPU variables are only used in the NAPI callback hence disabling
bottom halves is the only protection mechanism. Users from preemptible
context (like cpu_map_kthread_run()) explicitly disable bottom halves
for protections reasons.
Without locking in local_bh_disable() on PREEMPT_RT this data structure
requires explicit locking.

PREEMPT_RT has forced-threaded interrupts enabled and every
NAPI-callback runs in a thread. If each thread has its own data
structure then locking can be avoided.

Create a struct bpf_net_context which contains struct bpf_redirect_info.
Define the variable on stack, use bpf_net_ctx_set() to save a pointer to
it, bpf_net_ctx_clear() removes it again.
The bpf_net_ctx_set() may nest. For instance a function can be used from
within NET_RX_SOFTIRQ/ net_rx_action which uses bpf_net_ctx_set() and
NET_TX_SOFTIRQ which does not. Therefore only the first invocations
updates the pointer.
Use bpf_net_ctx_get_ri() as a wrapper to retrieve the current struct
bpf_redirect_info. The returned data structure is zero initialized to
ensure nothing is leaked from stack. This is done on first usage of the
struct. bpf_net_ctx_set() sets bpf_redirect_info::kern_flags to 0 to
note that initialisation is required. First invocation of
bpf_net_ctx_get_ri() will memset() the data structure and update
bpf_redirect_info::kern_flags.
bpf_redirect_info::nh is excluded from memset because it is only used
once BPF_F_NEIGH is set which also sets the nh member. The kern_flags is
moved past nh to exclude it from memset.

The pointer to bpf_net_context is saved task's task_struct. Using
always the bpf_net_context approach has the advantage that there is
almost zero differences between PREEMPT_RT and non-PREEMPT_RT builds.

Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Eduard Zingerman <eddyz87@gmail.com>
Cc: Hao Luo <haoluo@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Song Liu <song@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>
Cc: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20240620132727.660738-15-bigeasy@linutronix.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>

401cb7da

net: Use nested-BH locking for bpf_scratchpad. · 78f520b7

Sebastian Andrzej Siewior authored Jun 20, 2024

bpf_scratchpad is a per-CPU variable and relies on disabled BH for its
locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT
this data structure requires explicit locking.

Add a local_lock_t to the data structure and use local_lock_nested_bh()
for locking. This change adds only lockdep coverage and does not alter
the functional behaviour for !PREEMPT_RT.

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Hao Luo <haoluo@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Song Liu <song@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>
Cc: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20240620132727.660738-14-bigeasy@linutronix.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>

78f520b7

seg6: Use nested-BH locking for seg6_bpf_srh_states. · d1542d4a

Sebastian Andrzej Siewior authored Jun 20, 2024

The access to seg6_bpf_srh_states is protected by disabling preemption.
Based on the code, the entry point is input_action_end_bpf() and
every other function (the bpf helper functions bpf_lwt_seg6_*()), that
is accessing seg6_bpf_srh_states, should be called from within
input_action_end_bpf().

input_action_end_bpf() accesses seg6_bpf_srh_states first at the top of
the function and then disables preemption. This looks wrong because if
preemption needs to be disabled as part of the locking mechanism then
the variable shouldn't be accessed beforehand.

Looking at how it is used via test_lwt_seg6local.sh then
input_action_end_bpf() is always invoked from softirq context. If this
is always the case then the preempt_disable() statement is superfluous.
If this is not always invoked from softirq then disabling only
preemption is not sufficient.

Replace the preempt_disable() statement with nested-BH locking. This is
not an equivalent replacement as it assumes that the invocation of
input_action_end_bpf() always occurs in softirq context and thus the
preempt_disable() is superfluous.
Add a local_lock_t the data structure and use local_lock_nested_bh() for
locking. Add lockdep_assert_held() to ensure the lock is held while the
per-CPU variable is referenced in the helper functions.

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: David Ahern <dsahern@kernel.org>
Cc: Hao Luo <haoluo@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Song Liu <song@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>
Cc: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20240620132727.660738-13-bigeasy@linutronix.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>

d1542d4a

lwt: Don't disable migration prio invoking BPF. · 3414adbd

Sebastian Andrzej Siewior authored Jun 20, 2024

There is no need to explicitly disable migration if bottom halves are
also disabled. Disabling BH implies disabling migration.

Remove migrate_disable() and rely solely on disabling BH to remain on
the same CPU.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20240620132727.660738-12-bigeasy@linutronix.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>

3414adbd

dev: Use nested-BH locking for softnet_data.process_queue. · b22800f9

Sebastian Andrzej Siewior authored Jun 20, 2024

softnet_data::process_queue is a per-CPU variable and relies on disabled
BH for its locking. Without per-CPU locking in local_bh_disable() on
PREEMPT_RT this data structure requires explicit locking.

softnet_data::input_queue_head can be updated lockless. This is fine
because this value is only update CPU local by the local backlog_napi
thread.

Add a local_lock_t to softnet_data and use local_lock_nested_bh() for locking
of process_queue. This change adds only lockdep coverage and does not
alter the functional behaviour for !PREEMPT_RT.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20240620132727.660738-11-bigeasy@linutronix.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>

b22800f9

dev: Remove PREEMPT_RT ifdefs from backlog_lock.*(). · a8760d0d

Sebastian Andrzej Siewior authored Jun 20, 2024

The backlog_napi locking (previously RPS) relies on explicit locking if
either RPS or backlog NAPI is enabled. If both are disabled then locking
was achieved by disabling interrupts except on PREEMPT_RT. PREEMPT_RT
was excluded because the needed synchronisation was already provided
local_bh_disable().

Since the introduction of backlog NAPI and making it mandatory for
PREEMPT_RT the ifdef within backlog_lock.*() is obsolete and can be
removed.

Remove the ifdefs in backlog_lock.*().
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20240620132727.660738-10-bigeasy@linutronix.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>

a8760d0d

net: softnet_data: Make xmit per task. · ecefbc09

Sebastian Andrzej Siewior authored Jun 20, 2024

Softirq is preemptible on PREEMPT_RT. Without a per-CPU lock in
local_bh_disable() there is no guarantee that only one device is
transmitting at a time.
With preemption and multiple senders it is possible that the per-CPU
`recursion' counter gets incremented by different threads and exceeds
XMIT_RECURSION_LIMIT leading to a false positive recursion alert.
The `more' member is subject to similar problems if set by one thread
for one driver and wrongly used by another driver within another thread.

Instead of adding a lock to protect the per-CPU variable it is simpler
to make xmit per-task. Sending and receiving skbs happens always
in thread context anyway.

Having a lock to protected the per-CPU counter would block/ serialize two
sending threads needlessly. It would also require a recursive lock to
ensure that the owner can increment the counter further.

Make the softnet_data.xmit a task_struct member on PREEMPT_RT. Add
needed wrapper.

Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20240620132727.660738-9-bigeasy@linutronix.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>

ecefbc09

netfilter: br_netfilter: Use nested-BH locking for brnf_frag_data_storage. · c67ef53a

Sebastian Andrzej Siewior authored Jun 20, 2024

brnf_frag_data_storage is a per-CPU variable and relies on disabled BH
for its locking. Without per-CPU locking in local_bh_disable() on
PREEMPT_RT this data structure requires explicit locking.

Add a local_lock_t to the data structure and use local_lock_nested_bh()
for locking. This change adds only lockdep coverage and does not alter
the functional behaviour for !PREEMPT_RT.

Cc: Florian Westphal <fw@strlen.de>
Cc: Jozsef Kadlecsik <kadlec@netfilter.org>
Cc: Nikolay Aleksandrov <razor@blackwall.org>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Roopa Prabhu <roopa@nvidia.com>
Cc: bridge@lists.linux.dev
Cc: coreteam@netfilter.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20240620132727.660738-8-bigeasy@linutronix.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>

c67ef53a