Commits · 0b31fb9ba2b54896b0fe4b1ec0102031c40bdf4f · Kirill Smelkov / linux

10 Nov, 2022 6 commits

mlxsw: reg: Add Switch Port FDB Security Register · 0b31fb9b

Ido Schimmel authored Nov 08, 2022

Add the Switch Port FDB Security Register (SPFSR) that allows enabling
and disabling security checks on a given local port. In Linux terms, it
allows locking / unlocking a port.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

0b31fb9b

mlxsw: spectrum_trap: Register 802.1X packet traps with devlink · d85be0f5

Ido Schimmel authored Nov 08, 2022

Register the previously added packet traps with devlink. This allows
user space to tune their policers and in the case of the locked port
trap, user space can set its action to "trap" in order to gain
visibility into packets that were discarded by the device due to the
locked port check failure.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

d85be0f5

devlink: Add packet traps for 802.1X operation · 2640a82b

Ido Schimmel authored Nov 08, 2022

Add packet traps for 802.1X operation. The "eapol" control trap is used
to trap EAPOL packets and is required for the correct operation of the
control plane. The "locked_port" drop trap can be enabled to gain
visibility into packets that were dropped by the device due to the
locked bridge port check.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

2640a82b

bridge: switchdev: Reflect MAB bridge port flag to device drivers · 9c0ca02b

Ido Schimmel authored Nov 08, 2022

Reflect the 'BR_PORT_MAB' flag to device drivers so that:

* Drivers that support MAB could act upon the flag being toggled.
* Drivers that do not support MAB will prevent MAB from being enabled.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

9c0ca02b

bridge: switchdev: Allow device drivers to install locked FDB entries · 27fabd02

Hans J. Schultz authored Nov 08, 2022

When the bridge is offloaded to hardware, FDB entries are learned and
aged-out by the hardware. Some device drivers synchronize the hardware
and software FDBs by generating switchdev events towards the bridge.

When a port is locked, the hardware must not learn autonomously, as
otherwise any host will blindly gain authorization. Instead, the
hardware should generate events regarding hosts that are trying to gain
authorization and their MAC addresses should be notified by the device
driver as locked FDB entries towards the bridge driver.

Allow device drivers to notify the bridge driver about such entries by
extending the 'switchdev_notifier_fdb_info' structure with the 'locked'
bit. The bit can only be set by device drivers and not by the bridge
driver.

Prevent a locked entry from being installed if MAB is not enabled on the
bridge port.

If an entry already exists in the bridge driver, reject the locked entry
if the current entry does not have the "locked" flag set or if it points
to a different port. The same semantics are implemented in the software
data path.
Signed-off-by: Hans J. Schultz <netdev@kapio-technology.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

27fabd02

bridge: switchdev: Let device drivers determine FDB offload indication · 9baedc3c

Ido Schimmel authored Nov 08, 2022

Currently, FDB entries that are notified to the bridge via
'SWITCHDEV_FDB_ADD_TO_BRIDGE' are always marked as offloaded. With MAB
enabled, this will no longer be universally true. Device drivers will
report locked FDB entries to the bridge to let it know that the
corresponding hosts required authorization, but it does not mean that
these entries are necessarily programmed in the underlying hardware.

Solve this by determining the offload indication based of the
'offloaded' bit in the FDB notification.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

9baedc3c

09 Nov, 2022 13 commits

Merge branch 'net-devlink-move-netdev-notifier-block-to-dest-namespace-during-reload' · bf9b8556

Jakub Kicinski authored Nov 09, 2022

Jiri Pirko says:

====================
net: devlink: move netdev notifier block to dest namespace during reload

Patch #1 is just a dependency of patch #2, which is the actual fix.
====================

Link: https://lore.kernel.org/r/20221108132208.938676-1-jiri@resnulli.usSigned-off-by: Jakub Kicinski <kuba@kernel.org>

bf9b8556

net: devlink: move netdev notifier block to dest namespace during reload · 15feb56e

Jiri Pirko authored Nov 08, 2022

The notifier block tracking netdev changes in devlink is registered
during devlink_alloc() per-net, it is then unregistered
in devlink_free(). When devlink moves from net namespace to another one,
the notifier block needs to move along.

Fix this by adding forgotten call to move the block.
Reported-by: Ido Schimmel <idosch@idosch.org>
Fixes: 02a68a47 ("net: devlink: track netdev with devlink_port assigned")
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

15feb56e

net: introduce a helper to move notifier block to different namespace · 3e52fba0

Jiri Pirko authored Nov 08, 2022

Currently, net_dev() netdev notifier variant follows the netdev with
per-net notifier from namespace to namespace. This is implemented
by move_netdevice_notifiers_dev_net() helper.

For devlink it is needed to re-register per-net notifier during
devlink reload. Introduce a new helper called
move_netdevice_notifier_net() and share the unregister/register code
with existing move_netdevice_notifiers_dev_net() helper.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

3e52fba0

genetlink: correctly begin the iteration over policies · 154ba79c

Jakub Kicinski authored Nov 08, 2022

The return value from genl_op_iter_init() only tells us if
there are any policies but to begin the iteration (and therefore
load the first entry) we need to call genl_op_iter_next().
Note that it's safe to call genl_op_iter_next() on a family
with no ops, it will just return false.

This may lead to various crashes, a warning in
netlink_policy_dump_get_policy_idx() when policy is not found
or.. no problem at all if the kmalloc'ed memory happens to be
zeroed.

Fixes: b502b318 ("genetlink: use iterator in the op to policy map dumping")
Link: https://lore.kernel.org/r/20221108204128.330287-1-kuba@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

154ba79c

Merge tag 'rxrpc-next-20221108' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs · 3ca6c3b4

David S. Miller authored Nov 09, 2022

rxrpc changes

David Howells says:

====================
rxrpc: Increasing SACK size and moving away from softirq, part 1

AF_RXRPC has some issues that need addressing:

 (1) The SACK table has a maximum capacity of 255, but for modern networks
     that isn't sufficient.  This is hard to increase in the upstream code
     because of the way the application thread is coupled to the softirq
     and retransmission side through a ring buffer.  Adjustments to the rx
     protocol allows a capacity of up to 8192, and having a ring
     sufficiently large to accommodate that would use an excessive amount
     of memory as this is per-call.

 (2) Processing ACKs in softirq mode causes the ACKs get conflated, with
     only the most recent being considered.  Whilst this has the upside
     that the retransmission algorithm only needs to deal with the most
     recent ACK, it causes DATA transmission for a call to be very bursty
     because DATA packets cannot be transmitted in softirq mode.  Rather
     transmission must be delegated to either the application thread or a
     workqueue, so there tend to be sudden bursts of traffic for any
     particular call due to scheduling delays.

 (3) All crypto in a single call is done in series; however, each DATA
     packet is individually encrypted so encryption and decryption of large
     calls could be parallelised if spare CPU resources are available.

This is the first of a number of sets of patches that try and address them.
The overall aims of these changes include:

 (1) To get rid of the TxRx ring and instead pass the packets round in
     queues (eg. sk_buff_head).  On the Tx side, each ACK packet comes with
     a SACK table that can be parsed as-is, so there's no particular need
     to maintain our own; we just have to refer to the ACK.

     On the Rx side, we do need to maintain a SACK table with one bit per
     entry - but only if packets go missing - and we don't want to have to
     perform a complex transformation to get the information into an ACK
     packet.

 (2) To try and move almost all processing of received packets out of the
     softirq handler and into a high-priority kernel I/O thread.  Only the
     transferral of packets would be left there.  I would still use the
     encap_rcv hook to receive packets as there's a noticeable performance
     drop from letting the UDP socket put the packets into its own queue
     and then getting them out of there.

 (3) To make the I/O thread also do all the transmission.  The app thread
     would be responsible for packaging the data into packets and then
     buffering them for the I/O thread to transmit.  This would make it
     easier for the app thread to run ahead of the I/O thread, and would
     mean the I/O thread is less likely to have to wait around for a new
     packet to come available for transmission.

 (4) To logically partition the socket/UAPI/KAPI side of things from the
     I/O side of things.  The local endpoint, connection, peer and call
     objects would belong to the I/O side.  The socket side would not then
     touch the private internals of calls and suchlike and would not change
     their states.  It would only look at the send queue, receive queue and
     a way to pass a message to cause an abort.

 (5) To remove as much locking, synchronisation, barriering and atomic ops
     as possible from the I/O side.  Exclusion would be achieved by
     limiting modification of state to the I/O thread only.  Locks would
     still need to be used in communication with the UDP socket and the
     AF_RXRPC socket API.

 (6) To provide crypto offload kernel threads that, when there's slack in
     the system, can see packets that need crypting and provide
     parallelisation in dealing with them.

 (7) To remove the use of system timers.  Since each timer would then send
     a poke to the I/O thread, which would then deal with it when it had
     the opportunity, there seems no point in using system timers if,
     instead, a list of timeouts can be sensibly consulted.  An I/O thread
     only then needs to schedule with a timeout when it is idle.

 (8) To use zero-copy sendmsg to send packets.  This would make use of the
     I/O thread being the sole transmitter on the socket to manage the
     dead-reckoning sequencing of the completion notifications.  There is a
     problem with zero-copy, though: the UDP socket doesn't handle running
     out of option memory very gracefully.

With regard to this first patchset, the changes made include:

 (1) Some fixes, including a fallback for proc_create_net_single_write(),
     setting ack.bufferSize to 0 in ACK packets and a fix for rxrpc
     congestion management, which shouldn't be saving the cwnd value
     between calls.

 (2) Improvements in rxrpc tracepoints, including splitting the timer
     tracepoint into a set-timer and a timer-expired trace.

 (3) Addition of a new proc file to display some stats.

 (4) Some code cleanups, including removing some unused bits and
     unnecessary header inclusions.

 (5) A change to the recently added UDP encap_err_rcv hook so that it has
     the same signature as {ip,ipv6}_icmp_error(), and then just have rxrpc
     point its UDP socket's hook directly at those.

 (6) Definition of a new struct, rxrpc_txbuf, that is used to hold
     transmissible packets of DATA and ACK type in a single 2KiB block
     rather than using an sk_buff.  This allows the buffer to be on a
     number of queues simultaneously more easily, and also guarantees that
     the entire block is in a single unit for zerocopy purposes and that
     the data payload is aligned for in-place crypto purposes.

 (7) ACK txbufs are allocated at proposal and queued for later transmission
     rather than being stored in a single place in the rxrpc_call struct,
     which means only a single ACK can be pending transmission at a time.
     The queue is then drained at various points.  This allows the ACK
     generation code to be simplified.

 (8) The Rx ring buffer is removed.  When a jumbo packet is received (which
     comprises a number of ordinary DATA packets glued together), it used
     to be pointed to by the ring multiple times, with an annotation in a
     side ring indicating which subpacket was in that slot - but this is no
     longer possible.  Instead, the packet is cloned once for each
     subpacket, barring the last, and the range of data is set in the skb
     private area.  This makes it easier for the subpackets in a jumbo
     packet to be decrypted in parallel.

 (9) The Tx ring buffer is removed.  The side annotation ring that held the
     SACK information is also removed.  Instead, in the event of packet
     loss, the SACK data attached an ACK packet is parsed.

(10) Allocate an skcipher request when needed in the rxkad security class
     rather than caching one in the rxrpc_call struct.  This deals with a
     race between externally-driven call disconnection getting rid of the
     skcipher request and sendmsg/recvmsg trying to use it because they
     haven't seen the completion yet.  This is also needed to support
     parallelisation as the skcipher request cannot be used by two or more
     threads simultaneously.

(11) Call udp_sendmsg() and udpv6_sendmsg() directly rather than going
     through kernel_sendmsg() so that we can provide our own iterator
     (zerocopy explicitly doesn't work with a KVEC iterator).  This also
     lets us avoid the overhead of the security hook.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

3ca6c3b4

net/core: Allow live renaming when an interface is up · bd039b5e

Andy Ren authored Nov 07, 2022

Allow a network interface to be renamed when the interface
is up.

As described in the netconsole documentation [1], when netconsole is
used as a built-in, it will bring up the specified interface as soon as
possible. As a result, user space will not be able to rename the
interface since the kernel disallows renaming of interfaces that are
administratively up unless the 'IFF_LIVE_RENAME_OK' private flag was set
by the kernel.

The original solution [2] to this problem was to add a new parameter to
the netconsole configuration parameters that allows renaming of
the interface used by netconsole while it is administratively up.
However, during the discussion that followed, it became apparent that we
have no reason to keep the current restriction and instead we should
allow user space to rename interfaces regardless of their administrative
state:

1. The restriction was put in place over 20 years ago when renaming was
only possible via IOCTL and before rtnetlink started notifying user
space about such changes like it does today.

2. The 'IFF_LIVE_RENAME_OK' flag was added over 3 years ago in version
5.2 and no regressions were reported.

3. In-kernel listeners to 'NETDEV_CHANGENAME' do not seem to care about
the administrative state of interface.

Therefore, allow user space to rename running interfaces by removing the
restriction and the associated 'IFF_LIVE_RENAME_OK' flag. Help in
possible triage by emitting a message to the kernel log that an
interface was renamed while UP.

[1] https://www.kernel.org/doc/Documentation/networking/netconsole.rst
[2] https://lore.kernel.org/netdev/20221102002420.2613004-1-andy.ren@getcruise.com/Signed-off-by: Andy Ren <andy.ren@getcruise.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

bd039b5e

Merge branch 'dsa-microchip-checking' · b96c7b4c

David S. Miller authored Nov 09, 2022

Rakesh Sankaranarayanan says:

====================
net: dsa: microchip: ksz_pwrite status check for lan937x and irq and error checking updates for ksz series

This patch series include following changes,
- Add KSZ9563 inside ksz_switch_chips. As per current structure,
KSZ9893 is reused inside ksz_switch_chips structure, but since
there is a mismatch in number of irq's, new member added for KSZ9563
and sku detected based on Global Chip ID 4 Register. Compatible
string from device tree mapped to KSZ9563 for spi and i2c mode
probes.
- Assign device interrupt during i2c probe operation.
- Add error checking for ksz_pwrite inside lan937x_change_mtu. After v6.0,
ksz_pwrite updated to have return type int instead of void, and
lan937x_change_mtu still uses ksz_pwrite without status verification.
- Add port_nirq as 3 for KSZ8563 switch family.
- Use dev_err_probe() instead of dev_err() to have more standardized error
formatting and logging.

v1 -> v2:
- Removed regmap validation patch from the series, planning to take
  up in future after checking for any better approach and studying
  the actual need for this change.
- Resolved error reported in ksz8863_smi.c file.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

b96c7b4c

net: dsa: microchip: add dev_err_probe in probe functions · 9b183317

Rakesh Sankaranarayanan authored Nov 07, 2022

Probe functions uses normal dev_err() to check error conditions
and print messages. Replace dev_err() with dev_err_probe() to
have more standardized format and error logging.
Signed-off-by: Rakesh Sankaranarayanan <rakesh.sankaranarayanan@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9b183317

net: dsa: microchip: ksz8563: Add number of port irq · 4630d142

Rakesh Sankaranarayanan authored Nov 07, 2022

KSZ8563 have three port interrupts: PTP, PHY and ACL. Add
port_nirq as 3 for KSZ8563 inside ksz_chip_data.
Signed-off-by: Rakesh Sankaranarayanan <rakesh.sankaranarayanan@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4630d142

net: dsa: microchip: add error checking for ksz_pwrite · e06999c3

Rakesh Sankaranarayanan authored Nov 07, 2022

Add status validation for port register write inside
lan937x_change_mtu. ksz_pwrite and ksz_pread api's are
updated with return type int (Reference patch mentioned
below). Update lan937x_change_mtu with status validation
for ksz_pwrite16().

Link: https://patchwork.kernel.org/project/netdevbpf/patch/20220826105634.3855578-6-o.rempel@pengutronix.de/Signed-off-by: Rakesh Sankaranarayanan <rakesh.sankaranarayanan@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e06999c3

net: dsa: microchip: add irq in i2c probe · a9c6db3b

Rakesh Sankaranarayanan authored Nov 07, 2022

add device irq in i2c probe function.
Signed-off-by: Rakesh Sankaranarayanan <rakesh.sankaranarayanan@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a9c6db3b

net: dsa: microchip: add ksz9563 in ksz_switch_ops and select based on compatible string · ef912fe4

Rakesh Sankaranarayanan authored Nov 07, 2022

Add KSZ9563 inside ksz_switch_chips structure with
port_nirq as 3. KSZ9563 use KSZ9893 switch parameters
but port_nirq count is 3 for KSZ9563 whereas 2 for
KSZ9893. Add KSZ9563 inside ksz_switch_chips as a separate
member and from device tree map compatible string into
KSZ9563 inside ksz_spi.c and ksz9477_i2c.c.
Global Chip ID 1 and 2 registers read value 9893, select
sku based on  Global Chip ID 4 Register which read 0x1c
for KSZ9563.
Signed-off-by: Rakesh Sankaranarayanan <rakesh.sankaranarayanan@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ef912fe4

net: ethernet: renesas: rswitch: Fix endless loop in error paths · 380f9acd

Yoshihiro Shimoda authored Nov 07, 2022

Coverity reported that the error path in rswitch_gwca_queue_alloc_skb()
has an issue to cause endless loop. So, fix the issue by changing
variables' types from u32 to int. After changed the types,
rswitch_tx_free() should use rswitch_get_num_cur_queues() to
calculate number of current queues.
Reported-by: coverity-bot <keescook+coverity-bot@chromium.org>
Addresses-Coverity-ID: 1527147 ("Control flow issues")
Fixes: 3590918b ("net: ethernet: renesas: Add support for "Ethernet Switch"")
Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
Link: https://lore.kernel.org/r/20221107081021.2955122-1-yoshihiro.shimoda.uh@renesas.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

380f9acd

08 Nov, 2022 21 commits

lib: Fix some kernel-doc comments · 8e18be76

Yang Li authored Nov 07, 2022

Make the description of @policy to @p in nla_policy_len()
to clear the below warnings:

lib/nlattr.c:660: warning: Function parameter or member 'p' not described in 'nla_policy_len'
lib/nlattr.c:660: warning: Excess function parameter 'policy' description in 'nla_policy_len'

Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=2736Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Link: https://lore.kernel.org/r/20221107062623.6709-1-yang.lee@linux.alibaba.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

8e18be76

rxrpc: Allocate an skcipher each time needed rather than reusing · 30d95efe

David Howells authored Nov 04, 2022

In the rxkad security class, allocate the skcipher used to do packet
encryption and decription rather than allocating one up front and reusing
it for each packet.  Reusing the skcipher precludes doing crypto in
parallel.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org

30d95efe

rxrpc: Fix congestion management · 1fc4fa2a

David Howells authored Oct 03, 2022

rxrpc has a problem in its congestion management in that it saves the
congestion window size (cwnd) from one call to another, but if this is 0 at
the time is saved, then the next call may not actually manage to ever
transmit anything.

To this end:

 (1) Don't save cwnd between calls, but rather reset back down to the
     initial cwnd and re-enter slow-start if data transmission is idle for
     more than an RTT.

 (2) Preserve ssthresh instead, as that is a handy estimate of pipe
     capacity.  Knowing roughly when to stop slow start and enter
     congestion avoidance can reduce the tendency to overshoot and drop
     larger amounts of packets when probing.

In future, cwind growth also needs to be constrained when the window isn't
being filled due to being application limited.
Reported-by: Simon Wilkinson <sxw@auristor.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org

1fc4fa2a

rxrpc: Remove the rxtx ring · 6869ddb8

David Howells authored Jun 15, 2022

The Rx/Tx ring is no longer used, so remove it.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org

6869ddb8

rxrpc: Save last ACK's SACK table rather than marking txbufs · d57a3a15

David Howells authored May 07, 2022

Improve the tracking of which packets need to be transmitted by saving the
last ACK packet that we receive that has a populated soft-ACK table rather
than marking packets.  Then we can step through the soft-ACK table and look
at the packets we've transmitted beyond that to determine which packets we
might want to retransmit.

We also look at the highest serial number that has been acked to try and
guess which packets we've transmitted the peer is likely to have seen.  If
necessary, we send a ping to retrieve that number.

One downside that might be a problem is that we can't then compare the
previous acked/unacked state so easily in rxrpc_input_soft_acks() - which
is a potential problem for the slow-start algorithm.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org

d57a3a15

rxrpc: Remove call->lock · 4e76bd40

David Howells authored May 06, 2022

call->lock is no longer necessary, so remove it.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org

4e76bd40

rxrpc: Don't use a ring buffer for call Tx queue · a4ea4c47

David Howells authored Mar 31, 2022

Change the way the Tx queueing works to make the following ends easier to
achieve:

 (1) The filling of packets, the encryption of packets and the transmission
     of packets can be handled in parallel by separate threads, rather than
     rxrpc_sendmsg() allocating, filling, encrypting and transmitting each
     packet before moving onto the next one.

 (2) Get rid of the fixed-size ring which sets a hard limit on the number
     of packets that can be retained in the ring.  This allows the number
     of packets to increase without having to allocate a very large ring or
     having variable-sized rings.

     [Note: the downside of this is that it's then less efficient to locate
     a packet for retransmission as we then have to step through a list and
     examine each buffer in the list.]

 (3) Allow the filler/encrypter to run ahead of the transmission window.

 (4) Make it easier to do zero copy UDP from the packet buffers.

 (5) Make it easier to do zero copy from userspace to the packet buffers -
     and thence to UDP (only if for unauthenticated connections).

To that end, the following changes are made:

 (1) Use the new rxrpc_txbuf struct instead of sk_buff for keeping packets
     to be transmitted in.  This allows them to be placed on multiple
     queues simultaneously.  An sk_buff isn't really necessary as it's
     never passed on to lower-level networking code.

 (2) Keep the transmissable packets in a linked list on the call struct
     rather than in a ring.  As a consequence, the annotation buffer isn't
     used either; rather a flag is set on the packet to indicate ackedness.

 (3) Use the RXRPC_CALL_TX_LAST flag to indicate that the last packet to be
     transmitted has been queued.  Add RXRPC_CALL_TX_ALL_ACKED to indicate
     that all packets up to and including the last got hard acked.

 (4) Wire headers are now stored in the txbuf rather than being concocted
     on the stack and they're stored immediately before the data, thereby
     allowing zerocopy of a single span.

 (5) Don't bother with instant-resend on transmission failure; rather,
     leave it for a timer or an ACK packet to trigger.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org

a4ea4c47

rxrpc: Get rid of the Rx ring · 5d7edbc9

David Howells authored Aug 27, 2022

Get rid of the Rx ring and replace it with a pair of queues instead.  One
queue gets the packets that are in-sequence and are ready for processing by
recvmsg(); the other queue gets the out-of-sequence packets for addition to
the first queue as the holes get filled.

The annotation ring is removed and replaced with a SACK table.  The SACK
table has the bits set that correspond exactly to the sequence number of
the packet being acked.  The SACK ring is copied when an ACK packet is
being assembled and rotated so that the first ACK is in byte 0.

Flow control handling is altered so that packets that are moved to the
in-sequence queue are hard-ACK'd even before they're consumed - and then
the Rx window size in the ACK packet (rsize) is shrunk down to compensate
(even going to 0 if the window is full).
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org

5d7edbc9

rxrpc: Clone received jumbo subpackets and queue separately · d4d02d8b

David Howells authored Oct 07, 2022

Split up received jumbo packets into separate skbuffs by cloning the
original skbuff for each subpacket and setting the offset and length of the
data in that subpacket in the skbuff's private data.  The subpackets are
then placed on the recvmsg queue separately.  The security class then gets
to revise the offset and length to remove its metadata.

If we fail to clone a packet, we just drop it and let the peer resend it.
The original packet gets used for the final subpacket.

This should make it easier to handle parallel decryption of the subpackets.
It also simplifies the handling of lost or misordered packets in the
queuing/buffering loop as the possibility of overlapping jumbo packets no
longer needs to be considered.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org

d4d02d8b

rxrpc: Split the rxrpc_recvmsg tracepoint · faf92e8d

David Howells authored Oct 07, 2022

Split the rxrpc_recvmsg tracepoint so that the tracepoints that are about
data packet processing (and which have extra pieces of information) are
separate from the tracepoint that shows the general flow of recvmsg().
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org

faf92e8d

rxrpc: Clean up ACK handling · 530403d9

David Howells authored Jan 30, 2020

Clean up the rxrpc_propose_ACK() function.  If deferred PING ACK proposal
is split out, it's only really needed for deferred DELAY ACKs.  All other
ACKs, bar terminal IDLE ACK are sent immediately.  The deferred IDLE ACK
submission can be handled by conversion of a DELAY ACK into an IDLE ACK if
there's nothing to be SACK'd.

Also, because there's a delay between an ACK being generated and being
transmitted, it's possible that other ACKs of the same type will be
generated during that interval.  Apart from the ACK time and the serial
number responded to, most of the ACK body, including window and SACK
parameters, are not filled out till the point of transmission - so we can
avoid generating a new ACK if there's one pending that will cover the SACK
data we need to convey.

Therefore, don't propose a new DELAY or IDLE ACK for a call if there's one
already pending.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org

530403d9

rxrpc: Allocate ACK records at proposal and queue for transmission · 72f0c6fb

David Howells authored Jan 30, 2020

Allocate rxrpc_txbuf records for ACKs and put onto a queue for the
transmitter thread to dispatch.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org

72f0c6fb

rxrpc: Define rxrpc_txbuf struct to carry data to be transmitted · 02a19356

David Howells authored Apr 05, 2022

Define a struct, rxrpc_txbuf, to carry data to be transmitted instead of a
socket buffer so that it can be placed onto multiple queues at once.  This
also allows the data buffer to be in the same allocation as the internal
data.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org

02a19356

rxrpc: Remove call->tx_phase · a11e6ff9

David Howells authored Oct 07, 2022

Remove call->tx_phase as it's only ever set.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org

a11e6ff9

rxrpc: Remove the flags from the rxrpc_skb tracepoint · 27f699cc

David Howells authored Oct 07, 2022

Remove the flags from the rxrpc_skb tracepoint as we're no longer going to
be using this for the transmission buffers and so marking which are
transmission buffers isn't going to be necessary.

Note that this also remove the rxrpc skb flag that indicates if this is a
transmission buffer and so the count is not updated for the moment.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org

27f699cc

rxrpc: Remove unnecessary header inclusions · 23b237f3

David Howells authored Jan 23, 2020

Remove a bunch of unnecessary header inclusions.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org

23b237f3

rxrpc: Call udp_sendmsg() directly · ed472b0c

David Howells authored Mar 22, 2022

Call udp_sendmsg() and udpv6_sendmsg() directly rather than calling
kernel_sendmsg() as the latter assumes we want a kvec-class iterator.
However, zerocopy explicitly doesn't work with such an iterator.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org

ed472b0c

rxrpc: Use the core ICMP/ICMP6 parsers · b6c66c43

David Howells authored Oct 12, 2022

Make rxrpc_encap_rcv_err() pass the ICMP/ICMP6 skbuff to ip_icmp_error() or
ipv6_icmp_error() as appropriate to do the parsing rather than trying to do
it in rxrpc.

This pushes an error report onto the UDP socket's error queue and calls
->sk_error_report() from which point rxrpc can pick it up.

It would be preferable to steal the packet directly from ip*_icmp_error()
rather than letting it get queued, but this is probably good enough.

Also note that __udp4_lib_err() calls sk_error_report() twice in some
cases.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org

b6c66c43

net: Change the udp encap_err_rcv to allow use of {ip,ipv6}_icmp_error() · 42fb06b3

David Howells authored Oct 12, 2022

Change the udp encap_err_rcv signature to match ip_icmp_error() and
ipv6_icmp_error() so that those can be used from the called function and
export them.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: netdev@vger.kernel.org

42fb06b3

rxrpc: Fix ack.bufferSize to be 0 when generating an ack · 8889a711

David Howells authored Sep 07, 2022

ack.bufferSize should be set to 0 when generating an ack.

Fixes: 8d94aa38 ("rxrpc: Calls shouldn't hold socket refs")
Reported-by: Jeffrey Altman <jaltman@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org

8889a711

rxrpc: Record stats for why the REQUEST-ACK flag is being set · f7fa5242

David Howells authored Aug 18, 2022

Record stats for why the REQUEST-ACK flag is being set.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org

f7fa5242