1. 09 Nov, 2022 11 commits
    • Jiri Pirko's avatar
      net: introduce a helper to move notifier block to different namespace · 3e52fba0
      Jiri Pirko authored
      Currently, net_dev() netdev notifier variant follows the netdev with
      per-net notifier from namespace to namespace. This is implemented
      by move_netdevice_notifiers_dev_net() helper.
      
      For devlink it is needed to re-register per-net notifier during
      devlink reload. Introduce a new helper called
      move_netdevice_notifier_net() and share the unregister/register code
      with existing move_netdevice_notifiers_dev_net() helper.
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      3e52fba0
    • Jakub Kicinski's avatar
      genetlink: correctly begin the iteration over policies · 154ba79c
      Jakub Kicinski authored
      The return value from genl_op_iter_init() only tells us if
      there are any policies but to begin the iteration (and therefore
      load the first entry) we need to call genl_op_iter_next().
      Note that it's safe to call genl_op_iter_next() on a family
      with no ops, it will just return false.
      
      This may lead to various crashes, a warning in
      netlink_policy_dump_get_policy_idx() when policy is not found
      or.. no problem at all if the kmalloc'ed memory happens to be
      zeroed.
      
      Fixes: b502b318 ("genetlink: use iterator in the op to policy map dumping")
      Link: https://lore.kernel.org/r/20221108204128.330287-1-kuba@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      154ba79c
    • David S. Miller's avatar
      Merge tag 'rxrpc-next-20221108' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs · 3ca6c3b4
      David S. Miller authored
      rxrpc changes
      
      David Howells says:
      
      ====================
      rxrpc: Increasing SACK size and moving away from softirq, part 1
      
      AF_RXRPC has some issues that need addressing:
      
       (1) The SACK table has a maximum capacity of 255, but for modern networks
           that isn't sufficient.  This is hard to increase in the upstream code
           because of the way the application thread is coupled to the softirq
           and retransmission side through a ring buffer.  Adjustments to the rx
           protocol allows a capacity of up to 8192, and having a ring
           sufficiently large to accommodate that would use an excessive amount
           of memory as this is per-call.
      
       (2) Processing ACKs in softirq mode causes the ACKs get conflated, with
           only the most recent being considered.  Whilst this has the upside
           that the retransmission algorithm only needs to deal with the most
           recent ACK, it causes DATA transmission for a call to be very bursty
           because DATA packets cannot be transmitted in softirq mode.  Rather
           transmission must be delegated to either the application thread or a
           workqueue, so there tend to be sudden bursts of traffic for any
           particular call due to scheduling delays.
      
       (3) All crypto in a single call is done in series; however, each DATA
           packet is individually encrypted so encryption and decryption of large
           calls could be parallelised if spare CPU resources are available.
      
      This is the first of a number of sets of patches that try and address them.
      The overall aims of these changes include:
      
       (1) To get rid of the TxRx ring and instead pass the packets round in
           queues (eg. sk_buff_head).  On the Tx side, each ACK packet comes with
           a SACK table that can be parsed as-is, so there's no particular need
           to maintain our own; we just have to refer to the ACK.
      
           On the Rx side, we do need to maintain a SACK table with one bit per
           entry - but only if packets go missing - and we don't want to have to
           perform a complex transformation to get the information into an ACK
           packet.
      
       (2) To try and move almost all processing of received packets out of the
           softirq handler and into a high-priority kernel I/O thread.  Only the
           transferral of packets would be left there.  I would still use the
           encap_rcv hook to receive packets as there's a noticeable performance
           drop from letting the UDP socket put the packets into its own queue
           and then getting them out of there.
      
       (3) To make the I/O thread also do all the transmission.  The app thread
           would be responsible for packaging the data into packets and then
           buffering them for the I/O thread to transmit.  This would make it
           easier for the app thread to run ahead of the I/O thread, and would
           mean the I/O thread is less likely to have to wait around for a new
           packet to come available for transmission.
      
       (4) To logically partition the socket/UAPI/KAPI side of things from the
           I/O side of things.  The local endpoint, connection, peer and call
           objects would belong to the I/O side.  The socket side would not then
           touch the private internals of calls and suchlike and would not change
           their states.  It would only look at the send queue, receive queue and
           a way to pass a message to cause an abort.
      
       (5) To remove as much locking, synchronisation, barriering and atomic ops
           as possible from the I/O side.  Exclusion would be achieved by
           limiting modification of state to the I/O thread only.  Locks would
           still need to be used in communication with the UDP socket and the
           AF_RXRPC socket API.
      
       (6) To provide crypto offload kernel threads that, when there's slack in
           the system, can see packets that need crypting and provide
           parallelisation in dealing with them.
      
       (7) To remove the use of system timers.  Since each timer would then send
           a poke to the I/O thread, which would then deal with it when it had
           the opportunity, there seems no point in using system timers if,
           instead, a list of timeouts can be sensibly consulted.  An I/O thread
           only then needs to schedule with a timeout when it is idle.
      
       (8) To use zero-copy sendmsg to send packets.  This would make use of the
           I/O thread being the sole transmitter on the socket to manage the
           dead-reckoning sequencing of the completion notifications.  There is a
           problem with zero-copy, though: the UDP socket doesn't handle running
           out of option memory very gracefully.
      
      With regard to this first patchset, the changes made include:
      
       (1) Some fixes, including a fallback for proc_create_net_single_write(),
           setting ack.bufferSize to 0 in ACK packets and a fix for rxrpc
           congestion management, which shouldn't be saving the cwnd value
           between calls.
      
       (2) Improvements in rxrpc tracepoints, including splitting the timer
           tracepoint into a set-timer and a timer-expired trace.
      
       (3) Addition of a new proc file to display some stats.
      
       (4) Some code cleanups, including removing some unused bits and
           unnecessary header inclusions.
      
       (5) A change to the recently added UDP encap_err_rcv hook so that it has
           the same signature as {ip,ipv6}_icmp_error(), and then just have rxrpc
           point its UDP socket's hook directly at those.
      
       (6) Definition of a new struct, rxrpc_txbuf, that is used to hold
           transmissible packets of DATA and ACK type in a single 2KiB block
           rather than using an sk_buff.  This allows the buffer to be on a
           number of queues simultaneously more easily, and also guarantees that
           the entire block is in a single unit for zerocopy purposes and that
           the data payload is aligned for in-place crypto purposes.
      
       (7) ACK txbufs are allocated at proposal and queued for later transmission
           rather than being stored in a single place in the rxrpc_call struct,
           which means only a single ACK can be pending transmission at a time.
           The queue is then drained at various points.  This allows the ACK
           generation code to be simplified.
      
       (8) The Rx ring buffer is removed.  When a jumbo packet is received (which
           comprises a number of ordinary DATA packets glued together), it used
           to be pointed to by the ring multiple times, with an annotation in a
           side ring indicating which subpacket was in that slot - but this is no
           longer possible.  Instead, the packet is cloned once for each
           subpacket, barring the last, and the range of data is set in the skb
           private area.  This makes it easier for the subpackets in a jumbo
           packet to be decrypted in parallel.
      
       (9) The Tx ring buffer is removed.  The side annotation ring that held the
           SACK information is also removed.  Instead, in the event of packet
           loss, the SACK data attached an ACK packet is parsed.
      
      (10) Allocate an skcipher request when needed in the rxkad security class
           rather than caching one in the rxrpc_call struct.  This deals with a
           race between externally-driven call disconnection getting rid of the
           skcipher request and sendmsg/recvmsg trying to use it because they
           haven't seen the completion yet.  This is also needed to support
           parallelisation as the skcipher request cannot be used by two or more
           threads simultaneously.
      
      (11) Call udp_sendmsg() and udpv6_sendmsg() directly rather than going
           through kernel_sendmsg() so that we can provide our own iterator
           (zerocopy explicitly doesn't work with a KVEC iterator).  This also
           lets us avoid the overhead of the security hook.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3ca6c3b4
    • Andy Ren's avatar
      net/core: Allow live renaming when an interface is up · bd039b5e
      Andy Ren authored
      Allow a network interface to be renamed when the interface
      is up.
      
      As described in the netconsole documentation [1], when netconsole is
      used as a built-in, it will bring up the specified interface as soon as
      possible. As a result, user space will not be able to rename the
      interface since the kernel disallows renaming of interfaces that are
      administratively up unless the 'IFF_LIVE_RENAME_OK' private flag was set
      by the kernel.
      
      The original solution [2] to this problem was to add a new parameter to
      the netconsole configuration parameters that allows renaming of
      the interface used by netconsole while it is administratively up.
      However, during the discussion that followed, it became apparent that we
      have no reason to keep the current restriction and instead we should
      allow user space to rename interfaces regardless of their administrative
      state:
      
      1. The restriction was put in place over 20 years ago when renaming was
      only possible via IOCTL and before rtnetlink started notifying user
      space about such changes like it does today.
      
      2. The 'IFF_LIVE_RENAME_OK' flag was added over 3 years ago in version
      5.2 and no regressions were reported.
      
      3. In-kernel listeners to 'NETDEV_CHANGENAME' do not seem to care about
      the administrative state of interface.
      
      Therefore, allow user space to rename running interfaces by removing the
      restriction and the associated 'IFF_LIVE_RENAME_OK' flag. Help in
      possible triage by emitting a message to the kernel log that an
      interface was renamed while UP.
      
      [1] https://www.kernel.org/doc/Documentation/networking/netconsole.rst
      [2] https://lore.kernel.org/netdev/20221102002420.2613004-1-andy.ren@getcruise.com/Signed-off-by: default avatarAndy Ren <andy.ren@getcruise.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bd039b5e
    • David S. Miller's avatar
      Merge branch 'dsa-microchip-checking' · b96c7b4c
      David S. Miller authored
      Rakesh Sankaranarayanan says:
      
      ====================
      net: dsa: microchip: ksz_pwrite status check for lan937x and irq and error checking updates for ksz series
      
      This patch series include following changes,
      - Add KSZ9563 inside ksz_switch_chips. As per current structure,
      KSZ9893 is reused inside ksz_switch_chips structure, but since
      there is a mismatch in number of irq's, new member added for KSZ9563
      and sku detected based on Global Chip ID 4 Register. Compatible
      string from device tree mapped to KSZ9563 for spi and i2c mode
      probes.
      - Assign device interrupt during i2c probe operation.
      - Add error checking for ksz_pwrite inside lan937x_change_mtu. After v6.0,
      ksz_pwrite updated to have return type int instead of void, and
      lan937x_change_mtu still uses ksz_pwrite without status verification.
      - Add port_nirq as 3 for KSZ8563 switch family.
      - Use dev_err_probe() instead of dev_err() to have more standardized error
      formatting and logging.
      
      v1 -> v2:
      - Removed regmap validation patch from the series, planning to take
        up in future after checking for any better approach and studying
        the actual need for this change.
      - Resolved error reported in ksz8863_smi.c file.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b96c7b4c
    • Rakesh Sankaranarayanan's avatar
      net: dsa: microchip: add dev_err_probe in probe functions · 9b183317
      Rakesh Sankaranarayanan authored
      Probe functions uses normal dev_err() to check error conditions
      and print messages. Replace dev_err() with dev_err_probe() to
      have more standardized format and error logging.
      Signed-off-by: default avatarRakesh Sankaranarayanan <rakesh.sankaranarayanan@microchip.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9b183317
    • Rakesh Sankaranarayanan's avatar
      net: dsa: microchip: ksz8563: Add number of port irq · 4630d142
      Rakesh Sankaranarayanan authored
      KSZ8563 have three port interrupts: PTP, PHY and ACL. Add
      port_nirq as 3 for KSZ8563 inside ksz_chip_data.
      Signed-off-by: default avatarRakesh Sankaranarayanan <rakesh.sankaranarayanan@microchip.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4630d142
    • Rakesh Sankaranarayanan's avatar
      net: dsa: microchip: add error checking for ksz_pwrite · e06999c3
      Rakesh Sankaranarayanan authored
      Add status validation for port register write inside
      lan937x_change_mtu. ksz_pwrite and ksz_pread api's are
      updated with return type int (Reference patch mentioned
      below). Update lan937x_change_mtu with status validation
      for ksz_pwrite16().
      
      Link: https://patchwork.kernel.org/project/netdevbpf/patch/20220826105634.3855578-6-o.rempel@pengutronix.de/Signed-off-by: default avatarRakesh Sankaranarayanan <rakesh.sankaranarayanan@microchip.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e06999c3
    • Rakesh Sankaranarayanan's avatar
      net: dsa: microchip: add irq in i2c probe · a9c6db3b
      Rakesh Sankaranarayanan authored
      add device irq in i2c probe function.
      Signed-off-by: default avatarRakesh Sankaranarayanan <rakesh.sankaranarayanan@microchip.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a9c6db3b
    • Rakesh Sankaranarayanan's avatar
      net: dsa: microchip: add ksz9563 in ksz_switch_ops and select based on compatible string · ef912fe4
      Rakesh Sankaranarayanan authored
      Add KSZ9563 inside ksz_switch_chips structure with
      port_nirq as 3. KSZ9563 use KSZ9893 switch parameters
      but port_nirq count is 3 for KSZ9563 whereas 2 for
      KSZ9893. Add KSZ9563 inside ksz_switch_chips as a separate
      member and from device tree map compatible string into
      KSZ9563 inside ksz_spi.c and ksz9477_i2c.c.
      Global Chip ID 1 and 2 registers read value 9893, select
      sku based on  Global Chip ID 4 Register which read 0x1c
      for KSZ9563.
      Signed-off-by: default avatarRakesh Sankaranarayanan <rakesh.sankaranarayanan@microchip.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ef912fe4
    • Yoshihiro Shimoda's avatar
      net: ethernet: renesas: rswitch: Fix endless loop in error paths · 380f9acd
      Yoshihiro Shimoda authored
      Coverity reported that the error path in rswitch_gwca_queue_alloc_skb()
      has an issue to cause endless loop. So, fix the issue by changing
      variables' types from u32 to int. After changed the types,
      rswitch_tx_free() should use rswitch_get_num_cur_queues() to
      calculate number of current queues.
      Reported-by: default avatarcoverity-bot <keescook+coverity-bot@chromium.org>
      Addresses-Coverity-ID: 1527147 ("Control flow issues")
      Fixes: 3590918b ("net: ethernet: renesas: Add support for "Ethernet Switch"")
      Signed-off-by: default avatarYoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
      Link: https://lore.kernel.org/r/20221107081021.2955122-1-yoshihiro.shimoda.uh@renesas.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      380f9acd
  2. 08 Nov, 2022 29 commits
    • Yang Li's avatar
      lib: Fix some kernel-doc comments · 8e18be76
      Yang Li authored
      Make the description of @policy to @p in nla_policy_len()
      to clear the below warnings:
      
      lib/nlattr.c:660: warning: Function parameter or member 'p' not described in 'nla_policy_len'
      lib/nlattr.c:660: warning: Excess function parameter 'policy' description in 'nla_policy_len'
      
      Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=2736Reported-by: default avatarAbaci Robot <abaci@linux.alibaba.com>
      Signed-off-by: default avatarYang Li <yang.lee@linux.alibaba.com>
      Link: https://lore.kernel.org/r/20221107062623.6709-1-yang.lee@linux.alibaba.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8e18be76
    • David Howells's avatar
      rxrpc: Allocate an skcipher each time needed rather than reusing · 30d95efe
      David Howells authored
      In the rxkad security class, allocate the skcipher used to do packet
      encryption and decription rather than allocating one up front and reusing
      it for each packet.  Reusing the skcipher precludes doing crypto in
      parallel.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      30d95efe
    • David Howells's avatar
      rxrpc: Fix congestion management · 1fc4fa2a
      David Howells authored
      rxrpc has a problem in its congestion management in that it saves the
      congestion window size (cwnd) from one call to another, but if this is 0 at
      the time is saved, then the next call may not actually manage to ever
      transmit anything.
      
      To this end:
      
       (1) Don't save cwnd between calls, but rather reset back down to the
           initial cwnd and re-enter slow-start if data transmission is idle for
           more than an RTT.
      
       (2) Preserve ssthresh instead, as that is a handy estimate of pipe
           capacity.  Knowing roughly when to stop slow start and enter
           congestion avoidance can reduce the tendency to overshoot and drop
           larger amounts of packets when probing.
      
      In future, cwind growth also needs to be constrained when the window isn't
      being filled due to being application limited.
      Reported-by: default avatarSimon Wilkinson <sxw@auristor.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      1fc4fa2a
    • David Howells's avatar
      rxrpc: Remove the rxtx ring · 6869ddb8
      David Howells authored
      The Rx/Tx ring is no longer used, so remove it.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      6869ddb8
    • David Howells's avatar
      rxrpc: Save last ACK's SACK table rather than marking txbufs · d57a3a15
      David Howells authored
      Improve the tracking of which packets need to be transmitted by saving the
      last ACK packet that we receive that has a populated soft-ACK table rather
      than marking packets.  Then we can step through the soft-ACK table and look
      at the packets we've transmitted beyond that to determine which packets we
      might want to retransmit.
      
      We also look at the highest serial number that has been acked to try and
      guess which packets we've transmitted the peer is likely to have seen.  If
      necessary, we send a ping to retrieve that number.
      
      One downside that might be a problem is that we can't then compare the
      previous acked/unacked state so easily in rxrpc_input_soft_acks() - which
      is a potential problem for the slow-start algorithm.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      d57a3a15
    • David Howells's avatar
      rxrpc: Remove call->lock · 4e76bd40
      David Howells authored
      call->lock is no longer necessary, so remove it.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      4e76bd40
    • David Howells's avatar
      rxrpc: Don't use a ring buffer for call Tx queue · a4ea4c47
      David Howells authored
      Change the way the Tx queueing works to make the following ends easier to
      achieve:
      
       (1) The filling of packets, the encryption of packets and the transmission
           of packets can be handled in parallel by separate threads, rather than
           rxrpc_sendmsg() allocating, filling, encrypting and transmitting each
           packet before moving onto the next one.
      
       (2) Get rid of the fixed-size ring which sets a hard limit on the number
           of packets that can be retained in the ring.  This allows the number
           of packets to increase without having to allocate a very large ring or
           having variable-sized rings.
      
           [Note: the downside of this is that it's then less efficient to locate
           a packet for retransmission as we then have to step through a list and
           examine each buffer in the list.]
      
       (3) Allow the filler/encrypter to run ahead of the transmission window.
      
       (4) Make it easier to do zero copy UDP from the packet buffers.
      
       (5) Make it easier to do zero copy from userspace to the packet buffers -
           and thence to UDP (only if for unauthenticated connections).
      
      To that end, the following changes are made:
      
       (1) Use the new rxrpc_txbuf struct instead of sk_buff for keeping packets
           to be transmitted in.  This allows them to be placed on multiple
           queues simultaneously.  An sk_buff isn't really necessary as it's
           never passed on to lower-level networking code.
      
       (2) Keep the transmissable packets in a linked list on the call struct
           rather than in a ring.  As a consequence, the annotation buffer isn't
           used either; rather a flag is set on the packet to indicate ackedness.
      
       (3) Use the RXRPC_CALL_TX_LAST flag to indicate that the last packet to be
           transmitted has been queued.  Add RXRPC_CALL_TX_ALL_ACKED to indicate
           that all packets up to and including the last got hard acked.
      
       (4) Wire headers are now stored in the txbuf rather than being concocted
           on the stack and they're stored immediately before the data, thereby
           allowing zerocopy of a single span.
      
       (5) Don't bother with instant-resend on transmission failure; rather,
           leave it for a timer or an ACK packet to trigger.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      a4ea4c47
    • David Howells's avatar
      rxrpc: Get rid of the Rx ring · 5d7edbc9
      David Howells authored
      Get rid of the Rx ring and replace it with a pair of queues instead.  One
      queue gets the packets that are in-sequence and are ready for processing by
      recvmsg(); the other queue gets the out-of-sequence packets for addition to
      the first queue as the holes get filled.
      
      The annotation ring is removed and replaced with a SACK table.  The SACK
      table has the bits set that correspond exactly to the sequence number of
      the packet being acked.  The SACK ring is copied when an ACK packet is
      being assembled and rotated so that the first ACK is in byte 0.
      
      Flow control handling is altered so that packets that are moved to the
      in-sequence queue are hard-ACK'd even before they're consumed - and then
      the Rx window size in the ACK packet (rsize) is shrunk down to compensate
      (even going to 0 if the window is full).
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      5d7edbc9
    • David Howells's avatar
      rxrpc: Clone received jumbo subpackets and queue separately · d4d02d8b
      David Howells authored
      Split up received jumbo packets into separate skbuffs by cloning the
      original skbuff for each subpacket and setting the offset and length of the
      data in that subpacket in the skbuff's private data.  The subpackets are
      then placed on the recvmsg queue separately.  The security class then gets
      to revise the offset and length to remove its metadata.
      
      If we fail to clone a packet, we just drop it and let the peer resend it.
      The original packet gets used for the final subpacket.
      
      This should make it easier to handle parallel decryption of the subpackets.
      It also simplifies the handling of lost or misordered packets in the
      queuing/buffering loop as the possibility of overlapping jumbo packets no
      longer needs to be considered.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      d4d02d8b
    • David Howells's avatar
      rxrpc: Split the rxrpc_recvmsg tracepoint · faf92e8d
      David Howells authored
      Split the rxrpc_recvmsg tracepoint so that the tracepoints that are about
      data packet processing (and which have extra pieces of information) are
      separate from the tracepoint that shows the general flow of recvmsg().
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      faf92e8d
    • David Howells's avatar
      rxrpc: Clean up ACK handling · 530403d9
      David Howells authored
      Clean up the rxrpc_propose_ACK() function.  If deferred PING ACK proposal
      is split out, it's only really needed for deferred DELAY ACKs.  All other
      ACKs, bar terminal IDLE ACK are sent immediately.  The deferred IDLE ACK
      submission can be handled by conversion of a DELAY ACK into an IDLE ACK if
      there's nothing to be SACK'd.
      
      Also, because there's a delay between an ACK being generated and being
      transmitted, it's possible that other ACKs of the same type will be
      generated during that interval.  Apart from the ACK time and the serial
      number responded to, most of the ACK body, including window and SACK
      parameters, are not filled out till the point of transmission - so we can
      avoid generating a new ACK if there's one pending that will cover the SACK
      data we need to convey.
      
      Therefore, don't propose a new DELAY or IDLE ACK for a call if there's one
      already pending.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      530403d9
    • David Howells's avatar
      rxrpc: Allocate ACK records at proposal and queue for transmission · 72f0c6fb
      David Howells authored
      Allocate rxrpc_txbuf records for ACKs and put onto a queue for the
      transmitter thread to dispatch.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      72f0c6fb
    • David Howells's avatar
      rxrpc: Define rxrpc_txbuf struct to carry data to be transmitted · 02a19356
      David Howells authored
      Define a struct, rxrpc_txbuf, to carry data to be transmitted instead of a
      socket buffer so that it can be placed onto multiple queues at once.  This
      also allows the data buffer to be in the same allocation as the internal
      data.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      02a19356
    • David Howells's avatar
      rxrpc: Remove call->tx_phase · a11e6ff9
      David Howells authored
      Remove call->tx_phase as it's only ever set.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      a11e6ff9
    • David Howells's avatar
      rxrpc: Remove the flags from the rxrpc_skb tracepoint · 27f699cc
      David Howells authored
      Remove the flags from the rxrpc_skb tracepoint as we're no longer going to
      be using this for the transmission buffers and so marking which are
      transmission buffers isn't going to be necessary.
      
      Note that this also remove the rxrpc skb flag that indicates if this is a
      transmission buffer and so the count is not updated for the moment.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      27f699cc
    • David Howells's avatar
      rxrpc: Remove unnecessary header inclusions · 23b237f3
      David Howells authored
      Remove a bunch of unnecessary header inclusions.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      23b237f3
    • David Howells's avatar
      rxrpc: Call udp_sendmsg() directly · ed472b0c
      David Howells authored
      Call udp_sendmsg() and udpv6_sendmsg() directly rather than calling
      kernel_sendmsg() as the latter assumes we want a kvec-class iterator.
      However, zerocopy explicitly doesn't work with such an iterator.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      ed472b0c
    • David Howells's avatar
      rxrpc: Use the core ICMP/ICMP6 parsers · b6c66c43
      David Howells authored
      Make rxrpc_encap_rcv_err() pass the ICMP/ICMP6 skbuff to ip_icmp_error() or
      ipv6_icmp_error() as appropriate to do the parsing rather than trying to do
      it in rxrpc.
      
      This pushes an error report onto the UDP socket's error queue and calls
      ->sk_error_report() from which point rxrpc can pick it up.
      
      It would be preferable to steal the packet directly from ip*_icmp_error()
      rather than letting it get queued, but this is probably good enough.
      
      Also note that __udp4_lib_err() calls sk_error_report() twice in some
      cases.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      b6c66c43
    • David Howells's avatar
      net: Change the udp encap_err_rcv to allow use of {ip,ipv6}_icmp_error() · 42fb06b3
      David Howells authored
      Change the udp encap_err_rcv signature to match ip_icmp_error() and
      ipv6_icmp_error() so that those can be used from the called function and
      export them.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      cc: netdev@vger.kernel.org
      42fb06b3
    • David Howells's avatar
      rxrpc: Fix ack.bufferSize to be 0 when generating an ack · 8889a711
      David Howells authored
      ack.bufferSize should be set to 0 when generating an ack.
      
      Fixes: 8d94aa38 ("rxrpc: Calls shouldn't hold socket refs")
      Reported-by: default avatarJeffrey Altman <jaltman@auristor.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      8889a711
    • David Howells's avatar
      rxrpc: Record stats for why the REQUEST-ACK flag is being set · f7fa5242
      David Howells authored
      Record stats for why the REQUEST-ACK flag is being set.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      f7fa5242
    • David Howells's avatar
      rxrpc: Record statistics about ACK types · f2a676d1
      David Howells authored
      Record statistics about the different types of ACKs that have been
      transmitted and received and the number of ACKs that have been filled out
      and transmitted or that have been skipped.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      f2a676d1
    • David Howells's avatar
      rxrpc: Add stats procfile and DATA packet stats · b0154246
      David Howells authored
      Add a procfile, /proc/net/rxrpc/stats, to display some statistics about
      what rxrpc has been doing.  Writing a blank line to the stats file will
      clear the increment-only counters.  Allocated resource counters don't get
      cleared.
      
      Add some counters to count various things about DATA packets, including the
      number created, transmitted and retransmitted and the number received, the
      number of ACK-requests markings and the number of jumbo packets received.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      b0154246
    • David Howells's avatar
      rxrpc: Track highest acked serial · 589a0c1e
      David Howells authored
      Keep track of the highest DATA serial number that has been acked by the
      peer for future purposes.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      589a0c1e
    • David Howells's avatar
      rxrpc: Split call timer-expiration from call timer-set tracepoint · 334dfbfc
      David Howells authored
      Split the tracepoint for call timer-set to separate out the call
      timer-expiration event
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      334dfbfc
    • David Howells's avatar
      rxrpc: Trace setting of the request-ack flag · 4d843be5
      David Howells authored
      Add a tracepoint to log why the request-ack flag is set on an outgoing DATA
      packet, allowing debugging as to why.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      4d843be5
    • David Howells's avatar
      net, proc: Provide PROC_FS=n fallback for proc_create_net_single_write() · c3d96f69
      David Howells authored
      Provide a CONFIG_PROC_FS=n fallback for proc_create_net_single_write().
      
      Also provide a fallback for proc_create_net_data_write().
      
      Fixes: 564def71 ("proc: Add a way to make network proc files writable")
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      cc: netdev@vger.kernel.org
      c3d96f69
    • Paolo Abeni's avatar
      Merge branch 'bnxt_en-updates' · ee1bfbcc
      Paolo Abeni authored
      Michael Chan says:
      
      ====================
      bnxt_en: Updates
      
      This small patchset adds an improvement to the configuration of ethtool
      RSS tuple hash and a PTP improvement when running in a multi-host
      environment.
      ====================
      
      Link: https://lore.kernel.org/r/1667780192-3700-1-git-send-email-michael.chan@broadcom.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      ee1bfbcc
    • Pavan Chebbi's avatar
      bnxt_en: Add a non-real time mode to access NIC clock · 85036aee
      Pavan Chebbi authored
      When using a PHC that is shared between multiple hosts,
      in order to achieve consistent timestamps across all hosts,
      we need to isolate the PHC from any host making frequency
      adjustments.
      
      This patch adds a non-real time mode for this purpose.
      The implementation is based on a free running NIC hardware timer
      which is used as the timestamper time-base. Each host implements
      individual adjustments to a local timecounter based on the NIC free
      running timer.
      
      Cc: Richard Cochran <richardcochran@gmail.com>
      Signed-off-by: default avatarPavan Chebbi <pavan.chebbi@broadcom.com>
      Reviewed-by: default avatarAndy Gospodarek <andrew.gospodarek@broadcom.com>
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Reviewed-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      85036aee