Commits · a7b32744475cb774b4fce599f58394f4ef0acb68 · Kirill Smelkov / linux

11 Aug, 2024 19 commits

net: mvpp2: use device_for_each_child_node() to access device child nodes · a7b32744

Javier Carrasco authored Aug 08, 2024

The iterated nodes are direct children of the device node, and the
`device_for_each_child_node()` macro accounts for child node
availability.

`fwnode_for_each_available_child_node()` is meant to access the child
nodes of an fwnode, and therefore not direct child nodes of the device
node.

The child nodes within mvpp2_probe are not accessed outside the loops,
and the scoped version of the macro can be used to automatically
decrement the refcount on early exits.

Use `device_for_each_child_node()` and its scoped variant to indicate
device's direct child nodes.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Javier Carrasco <javier.carrasco.cruz@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a7b32744

net: mvpp2: use port_count to remove ports · e81d00a6

Javier Carrasco authored Aug 08, 2024

As discussed in [1], there is no need to iterate over child nodes to
remove the list of ports. Instead, a loop up to `port_count` ports can
be used, and is in fact more reliable in case the child node
availability changes.

The suggested approach removes the need for the `fwnode` and
`port_fwnode` variables in mvpp2_remove() as well.

Link: https://lore.kernel.org/all/ZqdRgDkK1PzoI2Pf@shell.armlinux.org.uk/ [1]
Suggested-by: Russell King <linux@armlinux.org.uk>
Signed-off-by: Javier Carrasco <javier.carrasco.cruz@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e81d00a6

Merge branch 'bnxt_en-fix-queue-reset-when-queue-active' · 80d021bc

David S. Miller authored Aug 11, 2024

David Wei says:

====================
fix bnxt_en queue reset when queue is active

The current bnxt_en queue API implementation is buggy when resetting a
queue that has active traffic. The problem is that there is no FW
involved to stop the flow of packets and relying on napi_disable() isn't
enough.

To fix this, call bnxt_hwrm_vnic_update() with MRU set to 0 for both the
default and the ntuple vnic to stop the flow of packets. This works for
any Rx queue and not only those that have ntuple rules since every Rx
queue is either in the default or the ntuple vnic.

For bnxt_hwrm_vnic_update() to work, proper flushing must be done by the
FW. A FW flag is there to indicate support and queue_mgmt_ops is keyed
behind this.

The first three patches are from Michael Chan and adds the prerequisite
vnic functions and FW flags indicating that it will properly flush
during vnic update.

Tested on BCM957504 while iperf3 is active:

1. Reset a queue that has an ntuple rule steering flow into it
2. Reset all queues in order, one at a time

In both cases the flow is not interrupted.

Sending this to net-next as there is no in-tree kernel consumer of queue
API just yet, and there is a patch that changes when the queue_mgmt_ops
is registered.
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
---
v3:
 - include patches from Michael Chan that adds a FW flag for vnic flush
   capability
 - key support for queue_mgmt_ops behind this new flag

v2:
 - split setting vnic->mru into a separate patch (Wojciech)
 - clarify why napi_enable()/disable() is removed
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

80d021bc

bnxt_en: only set dev->queue_mgmt_ops if supported by FW · 97cbf3d0

David Wei authored Aug 07, 2024

The queue API calls bnxt_hwrm_vnic_update() to stop/start the flow of
packets, which can only properly flush the pipeline if FW indicates
support.

Add a macro BNXT_SUPPORTS_QUEUE_API that checks for the required flags
and only set queue_mgmt_ops if true.
Signed-off-by: David Wei <dw@davidwei.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

97cbf3d0

bnxt_en: stop packet flow during bnxt_queue_stop/start · b9d2956e

David Wei authored Aug 07, 2024

The current implementation when resetting a queue while packets are
flowing puts the queue into an inconsistent state.

There needs to be some synchronisation with the FW. Add calls to
bnxt_hwrm_vnic_update() to set the MRU for both the default and ntuple
vnic during queue start/stop. When the MRU is set to 0, flow is stopped.
Each Rx queue belongs to either the default or the ntuple vnic.

With calling bnxt_hwrm_vnic_update() the calls to napi_enable() and
napi_disable() must be removed for reset to work on a queue that has
active traffic flowing e.g. iperf3.
Co-developed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: David Wei <dw@davidwei.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

b9d2956e

bnxt_en: set vnic->mru in bnxt_hwrm_vnic_cfg() · d41575f7

David Wei authored Aug 07, 2024

Set the newly added vnic->mru field in bnxt_hwrm_vnic_cfg().
Signed-off-by: David Wei <dw@davidwei.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

d41575f7

bnxt_en: Check the FW's VNIC flush capability · 6e360862

Michael Chan authored Aug 07, 2024

Check the HWRM_VNIC_QCAPS FW response for the receive engine flush
capability.  This capability indicates that we can reliably support
RX ring restart when calling HWRM_VNIC_UPDATE with MRU set to 0.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David Wei <dw@davidwei.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

6e360862

bnxt_en: Add support to call FW to update a VNIC · f2878cde

Michael Chan authored Aug 07, 2024

Add the function bnxt_hwrm_vnic_update() to call FW to update
a VNIC.  This call can be used when disabling and enabling a
receive ring within a VNIC.  The mru which is the maximum receive
size of packets received by the VNIC can be updated.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David Wei <dw@davidwei.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

f2878cde

bnxt_en: Update firmware interface to 1.10.3.68 · fbda8ee6

Michael Chan authored Aug 07, 2024

The main changes are:

1. HWRM_VNIC_UPDATE used to safely disable and enable an RX ring within
the VNIC.

2. New flag in HWRM_VNIC_QCAPS to indicate FW will do the proper flush
during HWRM_VNIC_UPDATE.

3. New flag in HWRM_FUNC_QCAPS to indicate that reservations for some
resources such as VNIC can be reduced.

4. New backing store memory types not used by the driver yet.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David Wei <dw@davidwei.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

fbda8ee6

Merge branch 'l2tp-misc-improvements' · 969afb43

David S. Miller authored Aug 11, 2024

James Chapman says:

====================
l2tp: misc improvements

This series makes several improvements to l2tp:

 * update documentation to be consistent with recent l2tp changes.
 * move l2tp_ip socket tables to per-net data.
 * fix handling of hash key collisions in l2tp_v3_session_get
 * implement and use get-next APIs for management and procfs/debugfs.
 * improve l2tp refcount helpers.
 * use per-cpu dev->tstats in l2tpeth devices.
 * fix a lockdep splat.
 * fix a race between l2tp_pre_exit_net and pppol2tp_release.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

969afb43

l2tp: flush workqueue before draining it · c1b2e36b

James Chapman authored Aug 07, 2024

syzbot exposes a race where a net used by l2tp is removed while an
existing pppol2tp socket is closed. In l2tp_pre_exit_net, l2tp queues
TUNNEL_DELETE work items to close each tunnel in the net. When these
are run, new SESSION_DELETE work items are queued to delete each
session in the tunnel. This all happens in drain_workqueue. However,
drain_workqueue allows only new work items if they are queued by other
work items which are already in the queue. If pppol2tp_release runs
after drain_workqueue has started, it may queue a SESSION_DELETE work
item, which results in the warning below in drain_workqueue.

Address this by flushing the workqueue before drain_workqueue such
that all queued TUNNEL_DELETE work items run before drain_workqueue is
started. This will queue SESSION_DELETE work items for each session in
the tunnel, hence pppol2tp_release or other API requests won't queue
SESSION_DELETE requests once drain_workqueue is started.

  WARNING: CPU: 1 PID: 5467 at kernel/workqueue.c:2259 __queue_work+0xcd3/0xf50 kernel/workqueue.c:2258
  Modules linked in:
  CPU: 1 UID: 0 PID: 5467 Comm: syz.3.43 Not tainted 6.11.0-rc1-syzkaller-00247-g3608d6ac #0
  Hardware name: Google Compute Engine/Google Compute Engine, BIOS Google 06/27/2024
  RIP: 0010:__queue_work+0xcd3/0xf50 kernel/workqueue.c:2258
  Code: ff e8 11 84 36 00 90 0f 0b 90 e9 1e fd ff ff e8 03 84 36 00 eb 13 e8 fc 83 36 00 eb 0c e8 f5 83 36 00 eb 05 e8 ee 83 36 00 90 <0f> 0b 90 48 83 c4 60 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc
  RSP: 0018:ffffc90004607b48 EFLAGS: 00010093
  RAX: ffffffff815ce274 RBX: ffff8880661fda00 RCX: ffff8880661fda00
  RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
  RBP: 0000000000000000 R08: ffffffff815cd6d4 R09: 0000000000000000
  R10: ffffc90004607c20 R11: fffff520008c0f85 R12: ffff88802ac33800
  R13: ffff88802ac339c0 R14: dffffc0000000000 R15: 0000000000000008
  FS:  00005555713eb500(0000) GS:ffff8880b9300000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000008 CR3: 000000001eda6000 CR4: 00000000003506f0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   <TASK>
   queue_work_on+0x1c2/0x380 kernel/workqueue.c:2392
   pppol2tp_release+0x163/0x230 net/l2tp/l2tp_ppp.c:445
   __sock_release net/socket.c:659 [inline]
   sock_close+0xbc/0x240 net/socket.c:1421
   __fput+0x24a/0x8a0 fs/file_table.c:422
   task_work_run+0x24f/0x310 kernel/task_work.c:228
   resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
   exit_to_user_mode_loop kernel/entry/common.c:114 [inline]
   exit_to_user_mode_prepare include/linux/entry-common.h:328 [inline]
   __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline]
   syscall_exit_to_user_mode+0x168/0x370 kernel/entry/common.c:218
   do_syscall_64+0x100/0x230 arch/x86/entry/common.c:89
   entry_SYSCALL_64_after_hwframe+0x77/0x7f
  RIP: 0033:0x7f061e9779f9
  Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48
  RSP: 002b:00007ffff1c1fce8 EFLAGS: 00000246 ORIG_RAX: 00000000000001b4
  RAX: 0000000000000000 RBX: 000000000001017d RCX: 00007f061e9779f9
  RDX: 0000000000000000 RSI: 000000000000001e RDI: 0000000000000003
  RBP: 00007ffff1c1fdc0 R08: 0000000000000001 R09: 00007ffff1c1ffcf
  R10: 00007f061e800000 R11: 0000000000000246 R12: 0000000000000032
  R13: 00007ffff1c1fde0 R14: 00007ffff1c1fe00 R15: ffffffffffffffff
  </TASK>

Fixes: fc7ec7f5 ("l2tp: delete sessions using work queue")
Reported-by: syzbot+0e85b10481d2f5478053@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=0e85b10481d2f5478053Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c1b2e36b

l2tp: l2tp_eth: use per-cpu counters from dev->tstats · dcc59d3e

James Chapman authored Aug 07, 2024

l2tp_eth uses old-style dev->stats for fastpath packet/byte
counters. Convert it to use dev->tstats per-cpu counters.
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dcc59d3e

l2tp: improve tunnel/session refcount helpers · abe7a1a7

James Chapman authored Aug 07, 2024

l2tp_tunnel_inc_refcount and l2tp_session_inc_refcount wrap
refcount_inc. They add no value so just use the refcount APIs directly
and drop l2tp's helpers. l2tp already uses refcount_inc_not_zero
anyway.

Rename l2tp_tunnel_dec_refcount and l2tp_session_dec_refcount to
l2tp_tunnel_put and l2tp_session_put to better match their use pairing
various _get getters.
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

abe7a1a7

l2tp: use get_next APIs for management requests and procfs/debugfs · 1f4c3dce

James Chapman authored Aug 07, 2024

l2tp netlink and procfs/debugfs iterate over tunnel and session lists
to obtain data. They currently use very inefficient get_nth functions
to do so. Replace these with get_next.

For netlink, use nl cb->ctx[] for passing state instead of the
obsolete cb->args[].

l2tp_tunnel_get_nth and l2tp_session_get_nth are no longer used so
they can be removed.
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1f4c3dce

l2tp: add tunnel/session get_next helpers · aa92c1ce

James Chapman authored Aug 07, 2024

l2tp management APIs and procfs/debugfs iterate over l2tp tunnel and
session lists. Since these lists are now implemented using IDR, we can
use IDR get_next APIs to iterate them. Add tunnel/session get_next
functions to do so.

The session get_next functions get the next session in a given tunnel
and need to account for l2tpv2 and l2tpv3 differences:

 * l2tpv2 sessions are keyed by tunnel ID / session ID. Iteration for
   a given tunnel ID, TID, can therefore start with a key given by
   TID/0 and finish when the next entry's tunnel ID is not TID. This
   is possible only because the tunnel ID part of the key is the upper
   16 bits and the session ID part the lower 16 bits; when idr_next
   increments the key value, it therefore finds the next sessions of
   the current tunnel before those of the next tunnel. Entries with
   session ID 0 are always skipped because they are used internally by
   pppol2tp.

 * l2tpv3 sessions are keyed by session ID. Iteration starts at the
   first IDR entry and skips entries where the tunnel does not
   match. Iteration must also consider session ID collisions and walk
   the list of colliding sessions (if any) for one which matches the
   supplied tunnel.
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

aa92c1ce

l2tp: handle hash key collisions in l2tp_v3_session_get · b0a8deda

James Chapman authored Aug 07, 2024

To handle colliding l2tpv3 session IDs, l2tp_v3_session_get searches a
hashed list keyed by ID and sk. Although unlikely, if hash keys
collide, it is possible that hash_for_each_possible loops over a
session which doesn't have the ID that we are searching for. So check
for session ID match when looping over possible hash key matches.
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b0a8deda

l2tp: move l2tp_ip and l2tp_ip6 data to pernet · ebed6606

James Chapman authored Aug 07, 2024

l2tp_ip[6] have always used global socket tables. It is therefore not
possible to create l2tpip sockets in different namespaces with the
same socket address.

To support this, move l2tpip socket tables to pernet data.
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ebed6606

l2tp: remove inline from functions in c sources · 168464c1

James Chapman authored Aug 07, 2024

Update l2tp to remove the inline keyword from several functions in C
sources, since this is now discouraged.
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

168464c1

documentation/networking: update l2tp docs · e2b1762c

James Chapman authored Aug 07, 2024

l2tp no longer uses sk_user_data in tunnel sockets and now manages
tunnel/session lifetimes slightly differently. Update docs to cover
this.

CC: linux-doc@vger.kernel.org
CC: corbet@lwn.net
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e2b1762c

10 Aug, 2024 21 commits

Merge branch 'mlx5-misc-patches-2024-08-08' · bbfeba26

Jakub Kicinski authored Aug 09, 2024

Tariq Toukan says:

====================
mlx5 misc patches 2024-08-08

This patchset contains multiple enhancements from the team to the mlx5
core and Eth drivers.

Patch #1 by Chris bumps a defined value to permit more devices doing TC
offloads.

Patch #2 by Jianbo adds an IPsec fast-path optimization to replace the
slow async handling.

Patches #3 and #4 by Jianbo add TC offload support for complicated rules
to overcome firmware limitation.

Patch #5 by Gal unifies the access macro to advertised/supported link
modes.

Patches #6 to #9 by Gal adds extack messages in ethtool ops to replace
prints to the kernel log.

Patch #10 by Cosmin switches to using 'update' verb instead of 'replace'
to better reflect the operation.

Patch #11 by Cosmin exposes an update connection tracking operation to
replace the assumed delete+add implementaiton.
====================

Link: https://patch.msgid.link/20240808055927.2059700-1-tariqt@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

bbfeba26

net/mlx5e: CT: Update connection tracking steering entries · 6b5662b7

Cosmin Ratiu authored Aug 08, 2024

Previously, replacing a connection tracking steering entry was done by
adding a new rule (with the same tag but possibly different mod hdr
actions/labels) then removing the old rule.

This approach doesn't work in hardware steering because two steering
entries with the same tag cannot coexist in a hardware steering table.

This commit prepares for that by adding a new ct_rule_update operation on
the ct_fs_ops struct which is used instead of add+delete.
Implementations for both dmfs (firmware steering) and smfs (software
steering) are provided, which simply add the new rule and delete the old
one.
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20240808055927.2059700-12-tariqt@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

6b5662b7

net/mlx5e: CT: 'update' rules instead of 'replace' · 486aeb2d

Cosmin Ratiu authored Aug 08, 2024

Offloaded rules can be updated with a new modify header action
containing a changed restore cookie. This was done using the verb
'replace', while in some configurations 'update' is a better fit.

This commit renames the functions used to reflect that.
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20240808055927.2059700-11-tariqt@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

486aeb2d

net/mlx5e: Use extack in get module eeprom by page callback · b5100b72

Gal Pressman authored Aug 08, 2024

In case of errors in get module eeprom by page, reflect it through
extack instead of a dmesg print.
While at it, make the messages more human friendly.
Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20240808055927.2059700-10-tariqt@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

b5100b72

net/mlx5e: Use extack in set coalesce callback · 9c4298b4

Gal Pressman authored Aug 08, 2024

In case of errors in set coalesce, reflect it through extack instead of
a dmesg print.
While at it, make the messages more human friendly.
Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20240808055927.2059700-9-tariqt@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

9c4298b4

net/mlx5e: Use extack in get coalesce callback · 29a943d7

Gal Pressman authored Aug 08, 2024

In case of errors in get coalesce, reflect it through extack instead of
a dmesg print.
Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20240808055927.2059700-8-tariqt@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

29a943d7

net/mlx5e: Use extack in set ringparams callback · ab666b52

Gal Pressman authored Aug 08, 2024

In case of errors in set ringparams, reflect it through extack instead
of a dmesg print.
While at it, make the messages more human friendly and remove two
redundant checks that are already validated by the core.
Signed-off-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20240808055927.2059700-7-tariqt@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

ab666b52

net/mlx5e: Be consistent with bitmap handling of link modes · 4384bcff

Gal Pressman authored Aug 08, 2024

Use the bitmap operations when accessing the advertised/supported link
modes and remove places that access them as arrays of unsigned longs
(underlying implementation of the bitmap), this makes the code much more
readable and clear.
Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20240808055927.2059700-6-tariqt@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

4384bcff

net/mlx5e: TC, Offload rewrite and mirror to both internal and external dests · b11bde56

Jianbo Liu authored Aug 08, 2024

Firmware has the limitation that it cannot offload a rule with rewrite
and mirror to internal and external destinations simultaneously.

This patch adds a workaround to this issue. Here the destination array
is split again, just like what's done in previous commit, but after
the action indexed by split_count - 1. An extra rule is added for the
leftover destinations. Such rule can be offloaded, even there are
destinations to both internal and external destinations, because the
header rewrite is left in the original FTE.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20240808055927.2059700-5-tariqt@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

b11bde56

net/mlx5e: TC, Offload rewrite and mirror on tunnel over ovs internal port · 16bb8c61

Jianbo Liu authored Aug 08, 2024

To offload the encap rule when the tunnel IP is configured on an
openvswitch internal port, driver need to overwrite vport metadata in
reg_c0 to the value assigned to the internal port, then forward
packets to root table to be processed again by the rules matching on
the metadata for such internal port.

When such rule is combined with header rewrite and mirror, openvswitch
generates the rule like the following, because it resets mirror after
packets are modified.
    in_port(enp8s0f0npf0sf1),..,
        actions:enp8s0f0npf0sf2,set(tunnel(...)),set(ipv4(...)),vxlan_sys_4789,enp8s0f0npf0sf2

The split_count was introduced before to support rewrite and mirror.
Driver splits the rule into two different hardware rules in order to
offload it. But it's not enough to offload the above complicated rule
because of the limitations, in both driver and firmware.

To resolve this issue, the destination array is split again after the
destination indexed by split_count. An extra rule is added for the
leftover destinations (in the above example, it is enp8s0f0npf0sf2),
and is inserted to post_act table. And the extra destination is added
in the original rule to forward to post_act table, so the extra mirror
is done there.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20240808055927.2059700-4-tariqt@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

16bb8c61

net/mlx5e: Enable remove flow for hard packet limit · 88c46f61

Jianbo Liu authored Aug 08, 2024

In the commit a2a73ea1 ("net/mlx5e: Don't listen to remove flows
event"), remove_flow_enable event is removed, and the hard limit
usually relies on software mechanism added in commit b2f7b01d
("net/mlx5e: Simulate missing IPsec TX limits hardware
functionality"). But the delayed work is rescheduled every one second,
which is slow for fast traffic. As a result, traffic can't be blocked
even reaches the hard limit, which usually happens when soft and hard
limits are very close.

In reality it won't happen because soft limit is much lower than hard
limit. But, as an optimization for RX to block traffic when reaching
hard limit, need to set remove_flow_enable. When remove flow is
enabled, IPSEC HARD_LIFETIME ASO syndrome will be set in the metadata
defined in the ASO return register if packets reach hard lifetime
threshold. And those packets are dropped immediately by the steering
table.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20240808055927.2059700-3-tariqt@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

88c46f61

net/mlx5: E-Switch, Increase max int port number for offload · 6e20d538

Chris Mi authored Aug 08, 2024

Currently MLX5E_TC_MAX_INT_PORT_NUM is 8. Usually int port has one
ingress and one egress rules. But sometimes, a temporary rule can be
offloaded as well, eg:

recirc_id(0),in_port(br-phy),eth(src=10:70:fd:87:57:c0,dst=33:33:00:00:00:16),
	eth_type(0x86dd),ipv6(frag=no), packets:2, bytes:180, used:0.060s,
	actions:enp8s0f0

If one int port device offloads 3 rules, only 2 devices can offload.
Other devices will hit the limit and fail to offload. Actually it is
insufficient for customers. So increase the number to 32.
Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20240808055927.2059700-2-tariqt@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

6e20d538

net: ag71xx: use phylink_mii_ioctl · 1862923b

Rosen Penev authored Aug 07, 2024

f1294617 removed the custom function for
ndo_eth_ioctl and used the standard phy_do_ioctl which calls
phy_mii_ioctl. However since then, this driver was ported to phylink
where it makes more sense to call phylink_mii_ioctl.

Bring back custom function that calls phylink_mii_ioctl.
Signed-off-by: Rosen Penev <rosenp@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20240807215834.33980-1-rosenp@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

1862923b

Merge branch 'ibmvnic-ibmvnic-rr-patchset' · c89c6757

Jakub Kicinski authored Aug 09, 2024

Nick Child says:

====================
ibmvnic: ibmvnic rr patchset

v1 - https://lore.kernel.org/netdev/20240801212340.132607-1-nnac123@linux.ibm.com/
v2 - https://lore.kernel.org/netdev/20240806193706.998148-1-nnac123@linux.ibm.com/
====================

Link: https://patch.msgid.link/20240807211809.1259563-1-nnac123@linux.ibm.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

c89c6757

ibmvnic: Perform tx CSO during send scrq direct · e633e32b

Nick Child authored Aug 07, 2024

During initialization with the vnic server, a bitstring is communicated
to the client regarding header info needed during CSO (See "VNIC
Capabilities" in PAPR). Most of the time, to be safe, vnic server
requests header info for CSO. When header info is needed, multiple TX
descriptors are required per skb; This limits the driver to use
send_subcrq_indirect instead of send_subcrq_direct.

Previously, the vnic server request for header info was ignored. This
allowed the use of send_sub_crq_direct. Transmissions were successful
because the bitstring returned by vnic server is broad and over
cautionary. It was observed that mlx backing devices could actually
transmit and handle CSO packets without the vnic server receiving
header info (despite the fact that the bitstring requested it).

There was a trust issue: The bitstring was overcautionary. This extra
precaution (requesting header info when the backing device may not use
it) comes at the cost of performance (using direct vs indirect hcalls
has a 30% delta in small packet RR transaction rate). So it has been
requested that the vnic server team tries to ensure that the bitstring
is more exact. In the meantime, disable CSO when it is possible to use
the skb in the send_subcrq_direct path. In other words, calculate the
checksum before handing the packet to FW when the packet is not
segmented and xmit_more is false.

Since the code path is only possible if the skb is non GSO and xmit_more
is false, the cost of doing checksum in the send_subcrq_direct path is
minimal. Any large segmented skb will have xmit_more set to true more
frequently and it is inexpensive to do checksumming on a small skb.
The worst-case workload would be a 9000 MTU TCP_RR test with close
to MTU sized packets (and TSO off). This allows xmit_more to be false
more frequently and open the code path up to use send_subcrq_direct.
Observing trace data (graph-time = 1) and packet rate with this workload
shows minimal performance degradation:

1. NIC does checksum w headers, safely use send_subcrq_indirect:
- Packet rate: 631k txs
- Trace data:
ibmvnic_xmit = 44344685.87 us / 6234576 hits = AVG 7.11 us
skb_checksum_help = 4.07 us / 2 hits = AVG 2.04 us
^ Notice hits, tracing this just for reassurance
ibmvnic_tx_scrq_flush = 33040649.69 us / 5638441 hits = AVG 5.86 us
send_subcrq_indirect = 37438922.24 us / 6030859 hits = AVG 6.21 us

2. NIC does checksum w/o headers, dangerously use send_subcrq_direct:
- Packet rate: 831k txs
- Trace data:
ibmvnic_xmit = 48940092.29 us / 8187630 hits = AVG 5.98 us
skb_checksum_help = 2.03 us / 1 hits = AVG 2.03
ibmvnic_tx_scrq_flush = 31141879.57 us / 7948960 hits = AVG 3.92 us
send_subcrq_indirect = 8412506.03 us / 728781 hits = AVG 11.54
^ notice hits is much lower b/c send_subcrq_direct was called
^ wasn't traceable

3. driver does checksum, safely use send_subcrq_direct (THIS PATCH):
- Packet rate: 829k txs
- Trace data:
ibmvnic_xmit = 56696077.63 us / 8066168 hits = AVG 7.03 us
skb_checksum_help = 8587456.16 us / 7526072 hits = AVG 1.14 us
ibmvnic_tx_scrq_flush = 30219545.55 us / 7782409 hits = AVG 3.88 us
send_subcrq_indirect = 8638326.44 us / 763693 hits = AVG 11.31 us

When the bitstring ever specifies that CSO does not require headers
(dependent on VIOS vnic server changes), then this patch should be
removed and replaced with one that investigates the bitstring before
using send_subcrq_direct.
Signed-off-by: Nick Child <nnac123@linux.ibm.com>
Link: https://patch.msgid.link/20240807211809.1259563-8-nnac123@linux.ibm.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

e633e32b

ibmvnic: Only record tx completed bytes once per handler · 1c33e292

Nick Child authored Aug 07, 2024

Byte Queue Limits depends on dql_completed being called once per tx
completion round in order to adjust its algorithm appropriately. The
dql->limit value is an approximation of the amount of bytes that the NIC
can consume per irq interval. If this approximation is too high then the
NIC will become over-saturated. Too low and the NIC will starve.

The dql->limit depends on dql->prev-* stats to calculate an optimal
value. If dql_completed() is called more than once per irq handler then
those prev-* values become unreliable (because they are not an accurate
representation of the previous state of the NIC) resulting in a
sub-optimal limit value.

Therefore, move the call to netdev_tx_completed_queue() to the end of
ibmvnic_complete_tx().

When performing 150 sessions of TCP rr (request-response 1 byte packets)
workloads, one could observe:
  PREVIOUSLY: - limit and inflight values hovering around 130
              - transaction rate of around 750k pps.

  NOW:        - limit rises and falls in response to inflight (130-900)
              - transaction rate of around 1M pps (33% improvement)
Signed-off-by: Nick Child <nnac123@linux.ibm.com>
Link: https://patch.msgid.link/20240807211809.1259563-7-nnac123@linux.ibm.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

1c33e292

ibmvnic: Introduce send sub-crq direct · 74839f7a

Nick Child authored Aug 07, 2024

Firmware supports two hcalls to send a sub-crq request:
H_SEND_SUB_CRQ_INDIRECT and H_SEND_SUB_CRQ. The indirect hcall allows
for submission of batched messages while the other hcall is limited to
only one message. This protocol is defined in PAPR section 17.2.3.3.

Previously, the ibmvnic xmit function only used the indirect hcall. This
allowed the driver to batch it's skbs. A single skb can occupy a few
entries per hcall depending on if FW requires skb header information or
not. The FW only needs header information if the packet is segmented.

By this logic, if an skb is not GSO then it can fit in one sub-crq
message and therefore is a candidate for H_SEND_SUB_CRQ.
Batching skb transmission is only useful when there are more packets
coming down the line (ie netdev_xmit_more is true).

As it turns out, H_SEND_SUB_CRQ induces less latency than
H_SEND_SUB_CRQ_INDIRECT. Therefore, use H_SEND_SUB_CRQ where
appropriate.

Small latency gains seen when doing TCP_RR_150 (request/response
workload). Ftrace results (graph-time=1):
Previous:
ibmvnic_xmit = 29618270.83 us / 8860058.0 hits = AVG 3.34
ibmvnic_tx_scrq_flush = 21972231.02 us / 6553972.0 hits = AVG 3.35
Now:
ibmvnic_xmit = 22153350.96 us / 8438942.0 hits = AVG 2.63
ibmvnic_tx_scrq_flush = 15858922.4 us / 6244076.0 hits = AVG 2.54
Signed-off-by: Nick Child <nnac123@linux.ibm.com>
Link: https://patch.msgid.link/20240807211809.1259563-6-nnac123@linux.ibm.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

74839f7a

ibmvnic: Remove duplicate memory barriers in tx · 6e7a5758

Nick Child authored Aug 07, 2024

send_subcrq_[in]direct() already has a dma memory barrier.
Remove the earlier one.
Signed-off-by: Nick Child <nnac123@linux.ibm.com>
Link: https://patch.msgid.link/20240807211809.1259563-5-nnac123@linux.ibm.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

6e7a5758

ibmvnic: Reduce memcpys in tx descriptor generation · d95f749a

Nick Child authored Aug 07, 2024

Previously when creating the header descriptors, the driver would:
1. allocate a temporary buffer on the stack (in build_hdr_descs_arr)
2. memcpy the header info into the temporary buffer (in build_hdr_data)
3. memcpy the temp buffer into a local variable (in create_hdr_descs)
4. copy the local variable into the return buffer (in create_hdr_descs)

Since, there is no opportunity for errors during this process, the temp
buffer is not needed and work can be done on the return buffer directly.

Repurpose build_hdr_data() to only calculate the header lengths. Rename
it to get_hdr_lens().
Edit create_hdr_descs() to read from the skb directly and copy directly
into the returned useful buffer.

The process now involves less memory and write operations while
also being more readable.
Signed-off-by: Nick Child <nnac123@linux.ibm.com>
Link: https://patch.msgid.link/20240807211809.1259563-4-nnac123@linux.ibm.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

d95f749a

ibmvnic: Use header len helper functions on tx · b41b45ec

Nick Child authored Aug 07, 2024

Use the header length helper functions rather than trying to calculate
it within the driver. There are defined functions for mac and network
headers (skb_mac_header_len and skb_network_header_len) but no such
function exists for the transport header length.

Also, hdr_data was memset during allocation to all 0's so no need to
memset again.
Signed-off-by: Nick Child <nnac123@linux.ibm.com>
Link: https://patch.msgid.link/20240807211809.1259563-3-nnac123@linux.ibm.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

b41b45ec

ibmvnic: Only replenish rx pool when resources are getting low · dda10fc8

Nick Child authored Aug 07, 2024

Previously, the driver would replenish the rx pool if the polling
function consumed less than the budget. The logic being that the driver
did not exhaust its budget so that must mean that the driver is not busy
and has cycles to spare for replenishing the pool.

So pool replenishment happens on every poll which did not consume
the budget. This can very costly during request-response tests.

In fact, an extra ~100pps can be seen in TCP_RR_150 tests when we remove
this conditional. Trace results (ftrace, graph-time=1) for the poll
function are below:
Previous results:
    ibmvnic_poll = 64951846.0 us / 4167628.0 hits = AVG 15.58
    replenish_rx_pool = 17602846.0 us / 4710437.0 hits = AVG 3.74
Now:
    ibmvnic_poll = 57673941.0 us / 4791737.0 hits = AVG 12.04
    replenish_rx_pool = 3938171.6 us / 4314.0 hits = AVG 912.88

While the replenish function takes longer, it is hit less frequently
meaning the ibmvnic_poll function, on average, is faster.

Furthermore, this change does not have a negative effect on
performance bandwidth/latency measurements.
Signed-off-by: Nick Child <nnac123@linux.ibm.com>
Link: https://patch.msgid.link/20240807211809.1259563-2-nnac123@linux.ibm.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

dda10fc8