1. 10 Nov, 2022 19 commits
  2. 09 Nov, 2022 13 commits
    • Jakub Kicinski's avatar
      Merge branch 'net-devlink-move-netdev-notifier-block-to-dest-namespace-during-reload' · bf9b8556
      Jakub Kicinski authored
      Jiri Pirko says:
      
      ====================
      net: devlink: move netdev notifier block to dest namespace during reload
      
      Patch #1 is just a dependency of patch #2, which is the actual fix.
      ====================
      
      Link: https://lore.kernel.org/r/20221108132208.938676-1-jiri@resnulli.usSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      bf9b8556
    • Jiri Pirko's avatar
      net: devlink: move netdev notifier block to dest namespace during reload · 15feb56e
      Jiri Pirko authored
      The notifier block tracking netdev changes in devlink is registered
      during devlink_alloc() per-net, it is then unregistered
      in devlink_free(). When devlink moves from net namespace to another one,
      the notifier block needs to move along.
      
      Fix this by adding forgotten call to move the block.
      Reported-by: default avatarIdo Schimmel <idosch@idosch.org>
      Fixes: 02a68a47 ("net: devlink: track netdev with devlink_port assigned")
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Tested-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      15feb56e
    • Jiri Pirko's avatar
      net: introduce a helper to move notifier block to different namespace · 3e52fba0
      Jiri Pirko authored
      Currently, net_dev() netdev notifier variant follows the netdev with
      per-net notifier from namespace to namespace. This is implemented
      by move_netdevice_notifiers_dev_net() helper.
      
      For devlink it is needed to re-register per-net notifier during
      devlink reload. Introduce a new helper called
      move_netdevice_notifier_net() and share the unregister/register code
      with existing move_netdevice_notifiers_dev_net() helper.
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      3e52fba0
    • Jakub Kicinski's avatar
      genetlink: correctly begin the iteration over policies · 154ba79c
      Jakub Kicinski authored
      The return value from genl_op_iter_init() only tells us if
      there are any policies but to begin the iteration (and therefore
      load the first entry) we need to call genl_op_iter_next().
      Note that it's safe to call genl_op_iter_next() on a family
      with no ops, it will just return false.
      
      This may lead to various crashes, a warning in
      netlink_policy_dump_get_policy_idx() when policy is not found
      or.. no problem at all if the kmalloc'ed memory happens to be
      zeroed.
      
      Fixes: b502b318 ("genetlink: use iterator in the op to policy map dumping")
      Link: https://lore.kernel.org/r/20221108204128.330287-1-kuba@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      154ba79c
    • David S. Miller's avatar
      Merge tag 'rxrpc-next-20221108' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs · 3ca6c3b4
      David S. Miller authored
      rxrpc changes
      
      David Howells says:
      
      ====================
      rxrpc: Increasing SACK size and moving away from softirq, part 1
      
      AF_RXRPC has some issues that need addressing:
      
       (1) The SACK table has a maximum capacity of 255, but for modern networks
           that isn't sufficient.  This is hard to increase in the upstream code
           because of the way the application thread is coupled to the softirq
           and retransmission side through a ring buffer.  Adjustments to the rx
           protocol allows a capacity of up to 8192, and having a ring
           sufficiently large to accommodate that would use an excessive amount
           of memory as this is per-call.
      
       (2) Processing ACKs in softirq mode causes the ACKs get conflated, with
           only the most recent being considered.  Whilst this has the upside
           that the retransmission algorithm only needs to deal with the most
           recent ACK, it causes DATA transmission for a call to be very bursty
           because DATA packets cannot be transmitted in softirq mode.  Rather
           transmission must be delegated to either the application thread or a
           workqueue, so there tend to be sudden bursts of traffic for any
           particular call due to scheduling delays.
      
       (3) All crypto in a single call is done in series; however, each DATA
           packet is individually encrypted so encryption and decryption of large
           calls could be parallelised if spare CPU resources are available.
      
      This is the first of a number of sets of patches that try and address them.
      The overall aims of these changes include:
      
       (1) To get rid of the TxRx ring and instead pass the packets round in
           queues (eg. sk_buff_head).  On the Tx side, each ACK packet comes with
           a SACK table that can be parsed as-is, so there's no particular need
           to maintain our own; we just have to refer to the ACK.
      
           On the Rx side, we do need to maintain a SACK table with one bit per
           entry - but only if packets go missing - and we don't want to have to
           perform a complex transformation to get the information into an ACK
           packet.
      
       (2) To try and move almost all processing of received packets out of the
           softirq handler and into a high-priority kernel I/O thread.  Only the
           transferral of packets would be left there.  I would still use the
           encap_rcv hook to receive packets as there's a noticeable performance
           drop from letting the UDP socket put the packets into its own queue
           and then getting them out of there.
      
       (3) To make the I/O thread also do all the transmission.  The app thread
           would be responsible for packaging the data into packets and then
           buffering them for the I/O thread to transmit.  This would make it
           easier for the app thread to run ahead of the I/O thread, and would
           mean the I/O thread is less likely to have to wait around for a new
           packet to come available for transmission.
      
       (4) To logically partition the socket/UAPI/KAPI side of things from the
           I/O side of things.  The local endpoint, connection, peer and call
           objects would belong to the I/O side.  The socket side would not then
           touch the private internals of calls and suchlike and would not change
           their states.  It would only look at the send queue, receive queue and
           a way to pass a message to cause an abort.
      
       (5) To remove as much locking, synchronisation, barriering and atomic ops
           as possible from the I/O side.  Exclusion would be achieved by
           limiting modification of state to the I/O thread only.  Locks would
           still need to be used in communication with the UDP socket and the
           AF_RXRPC socket API.
      
       (6) To provide crypto offload kernel threads that, when there's slack in
           the system, can see packets that need crypting and provide
           parallelisation in dealing with them.
      
       (7) To remove the use of system timers.  Since each timer would then send
           a poke to the I/O thread, which would then deal with it when it had
           the opportunity, there seems no point in using system timers if,
           instead, a list of timeouts can be sensibly consulted.  An I/O thread
           only then needs to schedule with a timeout when it is idle.
      
       (8) To use zero-copy sendmsg to send packets.  This would make use of the
           I/O thread being the sole transmitter on the socket to manage the
           dead-reckoning sequencing of the completion notifications.  There is a
           problem with zero-copy, though: the UDP socket doesn't handle running
           out of option memory very gracefully.
      
      With regard to this first patchset, the changes made include:
      
       (1) Some fixes, including a fallback for proc_create_net_single_write(),
           setting ack.bufferSize to 0 in ACK packets and a fix for rxrpc
           congestion management, which shouldn't be saving the cwnd value
           between calls.
      
       (2) Improvements in rxrpc tracepoints, including splitting the timer
           tracepoint into a set-timer and a timer-expired trace.
      
       (3) Addition of a new proc file to display some stats.
      
       (4) Some code cleanups, including removing some unused bits and
           unnecessary header inclusions.
      
       (5) A change to the recently added UDP encap_err_rcv hook so that it has
           the same signature as {ip,ipv6}_icmp_error(), and then just have rxrpc
           point its UDP socket's hook directly at those.
      
       (6) Definition of a new struct, rxrpc_txbuf, that is used to hold
           transmissible packets of DATA and ACK type in a single 2KiB block
           rather than using an sk_buff.  This allows the buffer to be on a
           number of queues simultaneously more easily, and also guarantees that
           the entire block is in a single unit for zerocopy purposes and that
           the data payload is aligned for in-place crypto purposes.
      
       (7) ACK txbufs are allocated at proposal and queued for later transmission
           rather than being stored in a single place in the rxrpc_call struct,
           which means only a single ACK can be pending transmission at a time.
           The queue is then drained at various points.  This allows the ACK
           generation code to be simplified.
      
       (8) The Rx ring buffer is removed.  When a jumbo packet is received (which
           comprises a number of ordinary DATA packets glued together), it used
           to be pointed to by the ring multiple times, with an annotation in a
           side ring indicating which subpacket was in that slot - but this is no
           longer possible.  Instead, the packet is cloned once for each
           subpacket, barring the last, and the range of data is set in the skb
           private area.  This makes it easier for the subpackets in a jumbo
           packet to be decrypted in parallel.
      
       (9) The Tx ring buffer is removed.  The side annotation ring that held the
           SACK information is also removed.  Instead, in the event of packet
           loss, the SACK data attached an ACK packet is parsed.
      
      (10) Allocate an skcipher request when needed in the rxkad security class
           rather than caching one in the rxrpc_call struct.  This deals with a
           race between externally-driven call disconnection getting rid of the
           skcipher request and sendmsg/recvmsg trying to use it because they
           haven't seen the completion yet.  This is also needed to support
           parallelisation as the skcipher request cannot be used by two or more
           threads simultaneously.
      
      (11) Call udp_sendmsg() and udpv6_sendmsg() directly rather than going
           through kernel_sendmsg() so that we can provide our own iterator
           (zerocopy explicitly doesn't work with a KVEC iterator).  This also
           lets us avoid the overhead of the security hook.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3ca6c3b4
    • Andy Ren's avatar
      net/core: Allow live renaming when an interface is up · bd039b5e
      Andy Ren authored
      Allow a network interface to be renamed when the interface
      is up.
      
      As described in the netconsole documentation [1], when netconsole is
      used as a built-in, it will bring up the specified interface as soon as
      possible. As a result, user space will not be able to rename the
      interface since the kernel disallows renaming of interfaces that are
      administratively up unless the 'IFF_LIVE_RENAME_OK' private flag was set
      by the kernel.
      
      The original solution [2] to this problem was to add a new parameter to
      the netconsole configuration parameters that allows renaming of
      the interface used by netconsole while it is administratively up.
      However, during the discussion that followed, it became apparent that we
      have no reason to keep the current restriction and instead we should
      allow user space to rename interfaces regardless of their administrative
      state:
      
      1. The restriction was put in place over 20 years ago when renaming was
      only possible via IOCTL and before rtnetlink started notifying user
      space about such changes like it does today.
      
      2. The 'IFF_LIVE_RENAME_OK' flag was added over 3 years ago in version
      5.2 and no regressions were reported.
      
      3. In-kernel listeners to 'NETDEV_CHANGENAME' do not seem to care about
      the administrative state of interface.
      
      Therefore, allow user space to rename running interfaces by removing the
      restriction and the associated 'IFF_LIVE_RENAME_OK' flag. Help in
      possible triage by emitting a message to the kernel log that an
      interface was renamed while UP.
      
      [1] https://www.kernel.org/doc/Documentation/networking/netconsole.rst
      [2] https://lore.kernel.org/netdev/20221102002420.2613004-1-andy.ren@getcruise.com/Signed-off-by: default avatarAndy Ren <andy.ren@getcruise.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bd039b5e
    • David S. Miller's avatar
      Merge branch 'dsa-microchip-checking' · b96c7b4c
      David S. Miller authored
      Rakesh Sankaranarayanan says:
      
      ====================
      net: dsa: microchip: ksz_pwrite status check for lan937x and irq and error checking updates for ksz series
      
      This patch series include following changes,
      - Add KSZ9563 inside ksz_switch_chips. As per current structure,
      KSZ9893 is reused inside ksz_switch_chips structure, but since
      there is a mismatch in number of irq's, new member added for KSZ9563
      and sku detected based on Global Chip ID 4 Register. Compatible
      string from device tree mapped to KSZ9563 for spi and i2c mode
      probes.
      - Assign device interrupt during i2c probe operation.
      - Add error checking for ksz_pwrite inside lan937x_change_mtu. After v6.0,
      ksz_pwrite updated to have return type int instead of void, and
      lan937x_change_mtu still uses ksz_pwrite without status verification.
      - Add port_nirq as 3 for KSZ8563 switch family.
      - Use dev_err_probe() instead of dev_err() to have more standardized error
      formatting and logging.
      
      v1 -> v2:
      - Removed regmap validation patch from the series, planning to take
        up in future after checking for any better approach and studying
        the actual need for this change.
      - Resolved error reported in ksz8863_smi.c file.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b96c7b4c
    • Rakesh Sankaranarayanan's avatar
      net: dsa: microchip: add dev_err_probe in probe functions · 9b183317
      Rakesh Sankaranarayanan authored
      Probe functions uses normal dev_err() to check error conditions
      and print messages. Replace dev_err() with dev_err_probe() to
      have more standardized format and error logging.
      Signed-off-by: default avatarRakesh Sankaranarayanan <rakesh.sankaranarayanan@microchip.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9b183317
    • Rakesh Sankaranarayanan's avatar
      net: dsa: microchip: ksz8563: Add number of port irq · 4630d142
      Rakesh Sankaranarayanan authored
      KSZ8563 have three port interrupts: PTP, PHY and ACL. Add
      port_nirq as 3 for KSZ8563 inside ksz_chip_data.
      Signed-off-by: default avatarRakesh Sankaranarayanan <rakesh.sankaranarayanan@microchip.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4630d142
    • Rakesh Sankaranarayanan's avatar
      net: dsa: microchip: add error checking for ksz_pwrite · e06999c3
      Rakesh Sankaranarayanan authored
      Add status validation for port register write inside
      lan937x_change_mtu. ksz_pwrite and ksz_pread api's are
      updated with return type int (Reference patch mentioned
      below). Update lan937x_change_mtu with status validation
      for ksz_pwrite16().
      
      Link: https://patchwork.kernel.org/project/netdevbpf/patch/20220826105634.3855578-6-o.rempel@pengutronix.de/Signed-off-by: default avatarRakesh Sankaranarayanan <rakesh.sankaranarayanan@microchip.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e06999c3
    • Rakesh Sankaranarayanan's avatar
      net: dsa: microchip: add irq in i2c probe · a9c6db3b
      Rakesh Sankaranarayanan authored
      add device irq in i2c probe function.
      Signed-off-by: default avatarRakesh Sankaranarayanan <rakesh.sankaranarayanan@microchip.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a9c6db3b
    • Rakesh Sankaranarayanan's avatar
      net: dsa: microchip: add ksz9563 in ksz_switch_ops and select based on compatible string · ef912fe4
      Rakesh Sankaranarayanan authored
      Add KSZ9563 inside ksz_switch_chips structure with
      port_nirq as 3. KSZ9563 use KSZ9893 switch parameters
      but port_nirq count is 3 for KSZ9563 whereas 2 for
      KSZ9893. Add KSZ9563 inside ksz_switch_chips as a separate
      member and from device tree map compatible string into
      KSZ9563 inside ksz_spi.c and ksz9477_i2c.c.
      Global Chip ID 1 and 2 registers read value 9893, select
      sku based on  Global Chip ID 4 Register which read 0x1c
      for KSZ9563.
      Signed-off-by: default avatarRakesh Sankaranarayanan <rakesh.sankaranarayanan@microchip.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ef912fe4
    • Yoshihiro Shimoda's avatar
      net: ethernet: renesas: rswitch: Fix endless loop in error paths · 380f9acd
      Yoshihiro Shimoda authored
      Coverity reported that the error path in rswitch_gwca_queue_alloc_skb()
      has an issue to cause endless loop. So, fix the issue by changing
      variables' types from u32 to int. After changed the types,
      rswitch_tx_free() should use rswitch_get_num_cur_queues() to
      calculate number of current queues.
      Reported-by: default avatarcoverity-bot <keescook+coverity-bot@chromium.org>
      Addresses-Coverity-ID: 1527147 ("Control flow issues")
      Fixes: 3590918b ("net: ethernet: renesas: Add support for "Ethernet Switch"")
      Signed-off-by: default avatarYoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
      Link: https://lore.kernel.org/r/20221107081021.2955122-1-yoshihiro.shimoda.uh@renesas.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      380f9acd
  3. 08 Nov, 2022 8 commits
    • Yang Li's avatar
      lib: Fix some kernel-doc comments · 8e18be76
      Yang Li authored
      Make the description of @policy to @p in nla_policy_len()
      to clear the below warnings:
      
      lib/nlattr.c:660: warning: Function parameter or member 'p' not described in 'nla_policy_len'
      lib/nlattr.c:660: warning: Excess function parameter 'policy' description in 'nla_policy_len'
      
      Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=2736Reported-by: default avatarAbaci Robot <abaci@linux.alibaba.com>
      Signed-off-by: default avatarYang Li <yang.lee@linux.alibaba.com>
      Link: https://lore.kernel.org/r/20221107062623.6709-1-yang.lee@linux.alibaba.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8e18be76
    • David Howells's avatar
      rxrpc: Allocate an skcipher each time needed rather than reusing · 30d95efe
      David Howells authored
      In the rxkad security class, allocate the skcipher used to do packet
      encryption and decription rather than allocating one up front and reusing
      it for each packet.  Reusing the skcipher precludes doing crypto in
      parallel.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      30d95efe
    • David Howells's avatar
      rxrpc: Fix congestion management · 1fc4fa2a
      David Howells authored
      rxrpc has a problem in its congestion management in that it saves the
      congestion window size (cwnd) from one call to another, but if this is 0 at
      the time is saved, then the next call may not actually manage to ever
      transmit anything.
      
      To this end:
      
       (1) Don't save cwnd between calls, but rather reset back down to the
           initial cwnd and re-enter slow-start if data transmission is idle for
           more than an RTT.
      
       (2) Preserve ssthresh instead, as that is a handy estimate of pipe
           capacity.  Knowing roughly when to stop slow start and enter
           congestion avoidance can reduce the tendency to overshoot and drop
           larger amounts of packets when probing.
      
      In future, cwind growth also needs to be constrained when the window isn't
      being filled due to being application limited.
      Reported-by: default avatarSimon Wilkinson <sxw@auristor.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      1fc4fa2a
    • David Howells's avatar
      rxrpc: Remove the rxtx ring · 6869ddb8
      David Howells authored
      The Rx/Tx ring is no longer used, so remove it.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      6869ddb8
    • David Howells's avatar
      rxrpc: Save last ACK's SACK table rather than marking txbufs · d57a3a15
      David Howells authored
      Improve the tracking of which packets need to be transmitted by saving the
      last ACK packet that we receive that has a populated soft-ACK table rather
      than marking packets.  Then we can step through the soft-ACK table and look
      at the packets we've transmitted beyond that to determine which packets we
      might want to retransmit.
      
      We also look at the highest serial number that has been acked to try and
      guess which packets we've transmitted the peer is likely to have seen.  If
      necessary, we send a ping to retrieve that number.
      
      One downside that might be a problem is that we can't then compare the
      previous acked/unacked state so easily in rxrpc_input_soft_acks() - which
      is a potential problem for the slow-start algorithm.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      d57a3a15
    • David Howells's avatar
      rxrpc: Remove call->lock · 4e76bd40
      David Howells authored
      call->lock is no longer necessary, so remove it.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      4e76bd40
    • David Howells's avatar
      rxrpc: Don't use a ring buffer for call Tx queue · a4ea4c47
      David Howells authored
      Change the way the Tx queueing works to make the following ends easier to
      achieve:
      
       (1) The filling of packets, the encryption of packets and the transmission
           of packets can be handled in parallel by separate threads, rather than
           rxrpc_sendmsg() allocating, filling, encrypting and transmitting each
           packet before moving onto the next one.
      
       (2) Get rid of the fixed-size ring which sets a hard limit on the number
           of packets that can be retained in the ring.  This allows the number
           of packets to increase without having to allocate a very large ring or
           having variable-sized rings.
      
           [Note: the downside of this is that it's then less efficient to locate
           a packet for retransmission as we then have to step through a list and
           examine each buffer in the list.]
      
       (3) Allow the filler/encrypter to run ahead of the transmission window.
      
       (4) Make it easier to do zero copy UDP from the packet buffers.
      
       (5) Make it easier to do zero copy from userspace to the packet buffers -
           and thence to UDP (only if for unauthenticated connections).
      
      To that end, the following changes are made:
      
       (1) Use the new rxrpc_txbuf struct instead of sk_buff for keeping packets
           to be transmitted in.  This allows them to be placed on multiple
           queues simultaneously.  An sk_buff isn't really necessary as it's
           never passed on to lower-level networking code.
      
       (2) Keep the transmissable packets in a linked list on the call struct
           rather than in a ring.  As a consequence, the annotation buffer isn't
           used either; rather a flag is set on the packet to indicate ackedness.
      
       (3) Use the RXRPC_CALL_TX_LAST flag to indicate that the last packet to be
           transmitted has been queued.  Add RXRPC_CALL_TX_ALL_ACKED to indicate
           that all packets up to and including the last got hard acked.
      
       (4) Wire headers are now stored in the txbuf rather than being concocted
           on the stack and they're stored immediately before the data, thereby
           allowing zerocopy of a single span.
      
       (5) Don't bother with instant-resend on transmission failure; rather,
           leave it for a timer or an ACK packet to trigger.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      a4ea4c47
    • David Howells's avatar
      rxrpc: Get rid of the Rx ring · 5d7edbc9
      David Howells authored
      Get rid of the Rx ring and replace it with a pair of queues instead.  One
      queue gets the packets that are in-sequence and are ready for processing by
      recvmsg(); the other queue gets the out-of-sequence packets for addition to
      the first queue as the holes get filled.
      
      The annotation ring is removed and replaced with a SACK table.  The SACK
      table has the bits set that correspond exactly to the sequence number of
      the packet being acked.  The SACK ring is copied when an ACK packet is
      being assembled and rotated so that the first ACK is in byte 0.
      
      Flow control handling is altered so that packets that are moved to the
      in-sequence queue are hard-ACK'd even before they're consumed - and then
      the Rx window size in the ACK packet (rsize) is shrunk down to compensate
      (even going to 0 if the window is full).
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      5d7edbc9