• David S. Miller's avatar
    Merge tag 'rxrpc-next-20221108' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs · 3ca6c3b4
    David S. Miller authored
    rxrpc changes
    
    David Howells says:
    
    ====================
    rxrpc: Increasing SACK size and moving away from softirq, part 1
    
    AF_RXRPC has some issues that need addressing:
    
     (1) The SACK table has a maximum capacity of 255, but for modern networks
         that isn't sufficient.  This is hard to increase in the upstream code
         because of the way the application thread is coupled to the softirq
         and retransmission side through a ring buffer.  Adjustments to the rx
         protocol allows a capacity of up to 8192, and having a ring
         sufficiently large to accommodate that would use an excessive amount
         of memory as this is per-call.
    
     (2) Processing ACKs in softirq mode causes the ACKs get conflated, with
         only the most recent being considered.  Whilst this has the upside
         that the retransmission algorithm only needs to deal with the most
         recent ACK, it causes DATA transmission for a call to be very bursty
         because DATA packets cannot be transmitted in softirq mode.  Rather
         transmission must be delegated to either the application thread or a
         workqueue, so there tend to be sudden bursts of traffic for any
         particular call due to scheduling delays.
    
     (3) All crypto in a single call is done in series; however, each DATA
         packet is individually encrypted so encryption and decryption of large
         calls could be parallelised if spare CPU resources are available.
    
    This is the first of a number of sets of patches that try and address them.
    The overall aims of these changes include:
    
     (1) To get rid of the TxRx ring and instead pass the packets round in
         queues (eg. sk_buff_head).  On the Tx side, each ACK packet comes with
         a SACK table that can be parsed as-is, so there's no particular need
         to maintain our own; we just have to refer to the ACK.
    
         On the Rx side, we do need to maintain a SACK table with one bit per
         entry - but only if packets go missing - and we don't want to have to
         perform a complex transformation to get the information into an ACK
         packet.
    
     (2) To try and move almost all processing of received packets out of the
         softirq handler and into a high-priority kernel I/O thread.  Only the
         transferral of packets would be left there.  I would still use the
         encap_rcv hook to receive packets as there's a noticeable performance
         drop from letting the UDP socket put the packets into its own queue
         and then getting them out of there.
    
     (3) To make the I/O thread also do all the transmission.  The app thread
         would be responsible for packaging the data into packets and then
         buffering them for the I/O thread to transmit.  This would make it
         easier for the app thread to run ahead of the I/O thread, and would
         mean the I/O thread is less likely to have to wait around for a new
         packet to come available for transmission.
    
     (4) To logically partition the socket/UAPI/KAPI side of things from the
         I/O side of things.  The local endpoint, connection, peer and call
         objects would belong to the I/O side.  The socket side would not then
         touch the private internals of calls and suchlike and would not change
         their states.  It would only look at the send queue, receive queue and
         a way to pass a message to cause an abort.
    
     (5) To remove as much locking, synchronisation, barriering and atomic ops
         as possible from the I/O side.  Exclusion would be achieved by
         limiting modification of state to the I/O thread only.  Locks would
         still need to be used in communication with the UDP socket and the
         AF_RXRPC socket API.
    
     (6) To provide crypto offload kernel threads that, when there's slack in
         the system, can see packets that need crypting and provide
         parallelisation in dealing with them.
    
     (7) To remove the use of system timers.  Since each timer would then send
         a poke to the I/O thread, which would then deal with it when it had
         the opportunity, there seems no point in using system timers if,
         instead, a list of timeouts can be sensibly consulted.  An I/O thread
         only then needs to schedule with a timeout when it is idle.
    
     (8) To use zero-copy sendmsg to send packets.  This would make use of the
         I/O thread being the sole transmitter on the socket to manage the
         dead-reckoning sequencing of the completion notifications.  There is a
         problem with zero-copy, though: the UDP socket doesn't handle running
         out of option memory very gracefully.
    
    With regard to this first patchset, the changes made include:
    
     (1) Some fixes, including a fallback for proc_create_net_single_write(),
         setting ack.bufferSize to 0 in ACK packets and a fix for rxrpc
         congestion management, which shouldn't be saving the cwnd value
         between calls.
    
     (2) Improvements in rxrpc tracepoints, including splitting the timer
         tracepoint into a set-timer and a timer-expired trace.
    
     (3) Addition of a new proc file to display some stats.
    
     (4) Some code cleanups, including removing some unused bits and
         unnecessary header inclusions.
    
     (5) A change to the recently added UDP encap_err_rcv hook so that it has
         the same signature as {ip,ipv6}_icmp_error(), and then just have rxrpc
         point its UDP socket's hook directly at those.
    
     (6) Definition of a new struct, rxrpc_txbuf, that is used to hold
         transmissible packets of DATA and ACK type in a single 2KiB block
         rather than using an sk_buff.  This allows the buffer to be on a
         number of queues simultaneously more easily, and also guarantees that
         the entire block is in a single unit for zerocopy purposes and that
         the data payload is aligned for in-place crypto purposes.
    
     (7) ACK txbufs are allocated at proposal and queued for later transmission
         rather than being stored in a single place in the rxrpc_call struct,
         which means only a single ACK can be pending transmission at a time.
         The queue is then drained at various points.  This allows the ACK
         generation code to be simplified.
    
     (8) The Rx ring buffer is removed.  When a jumbo packet is received (which
         comprises a number of ordinary DATA packets glued together), it used
         to be pointed to by the ring multiple times, with an annotation in a
         side ring indicating which subpacket was in that slot - but this is no
         longer possible.  Instead, the packet is cloned once for each
         subpacket, barring the last, and the range of data is set in the skb
         private area.  This makes it easier for the subpackets in a jumbo
         packet to be decrypted in parallel.
    
     (9) The Tx ring buffer is removed.  The side annotation ring that held the
         SACK information is also removed.  Instead, in the event of packet
         loss, the SACK data attached an ACK packet is parsed.
    
    (10) Allocate an skcipher request when needed in the rxkad security class
         rather than caching one in the rxrpc_call struct.  This deals with a
         race between externally-driven call disconnection getting rid of the
         skcipher request and sendmsg/recvmsg trying to use it because they
         haven't seen the completion yet.  This is also needed to support
         parallelisation as the skcipher request cannot be used by two or more
         threads simultaneously.
    
    (11) Call udp_sendmsg() and udpv6_sendmsg() directly rather than going
         through kernel_sendmsg() so that we can provide our own iterator
         (zerocopy explicitly doesn't work with a KVEC iterator).  This also
         lets us avoid the overhead of the security hook.
    ====================
    Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    3ca6c3b4
skbuff.c 166 KB