1. 23 Apr, 2024 33 commits
  2. 22 Apr, 2024 7 commits
    • Pavel Begunkov's avatar
      net: add callback for setting a ubuf_info to skb · 65bada80
      Pavel Begunkov authored
      At the moment an skb can only have one ubuf_info associated with it,
      which might be a performance problem for zerocopy sends in cases like
      TCP via io_uring. Add a callback for assigning ubuf_info to skb, this
      way we will implement smarter assignment later like linking ubuf_info
      together.
      
      Note, it's an optional callback, which should be compatible with
      skb_zcopy_set(), that's because the net stack might potentially decide
      to clone an skb and take another reference to ubuf_info whenever it
      wishes. Also, a correct implementation should always be able to bind to
      an skb without prior ubuf_info, otherwise we could end up in a situation
      when the send would not be able to progress.
      Reviewed-by: default avatarJens Axboe <axboe@kernel.dk>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Link: https://lore.kernel.org/all/b7918aadffeb787c84c9e72e34c729dc04f3a45d.1713369317.git.asml.silence@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      65bada80
    • Pavel Begunkov's avatar
      net: extend ubuf_info callback to ops structure · 7ab4f16f
      Pavel Begunkov authored
      We'll need to associate additional callbacks with ubuf_info, introduce
      a structure holding ubuf_info callbacks. Apart from a more smarter
      io_uring notification management introduced in next patches, it can be
      used to generalise msg_zerocopy_put_abort() and also store
      ->sg_from_iter, which is currently passed in struct msghdr.
      Reviewed-by: default avatarJens Axboe <axboe@kernel.dk>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Link: https://lore.kernel.org/all/a62015541de49c0e2a8a0377a1d5d0a5aeb07016.1713369317.git.asml.silence@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      7ab4f16f
    • Jakub Kicinski's avatar
      Merge branch 'tcp-avoid-sending-too-small-packets' · 65f1df11
      Jakub Kicinski authored
      Eric Dumazet says:
      
      ====================
      tcp: avoid sending too small packets
      
      tcp_sendmsg() cooks 'large' skbs, that are later split
      if needed from tcp_write_xmit().
      
      After a split, the leftover skb size is smaller than the optimal
      size, and this causes a performance drop.
      
      In this series, tcp_grow_skb() helper is added to shift
      payload from the second skb in the write queue to the first
      skb to always send optimal sized skbs.
      
      This increases TSO efficiency, and decreases number of ACK
      packets.
      ====================
      
      Link: https://lore.kernel.org/r/20240418214600.1291486-1-edumazet@google.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      65f1df11
    • Eric Dumazet's avatar
      tcp: try to send bigger TSO packets · 8ee602c6
      Eric Dumazet authored
      While investigating TCP performance, I found that TCP would
      sometimes send big skbs followed by a single MSS skb,
      in a 'locked' pattern.
      
      For instance, BIG TCP is enabled, MSS is set to have 4096 bytes
      of payload per segment. gso_max_size is set to 181000.
      
      This means that an optimal TCP packet size should contain
      44 * 4096 = 180224 bytes of payload,
      
      However, I was seeing packets sizes interleaved in this pattern:
      
      172032, 8192, 172032, 8192, 172032, 8192, <repeat>
      
      tcp_tso_should_defer() heuristic is defeated, because after a split of
      a packet in write queue for whatever reason (this might be a too small
      CWND or a small enough pacing_rate),
      the leftover packet in the queue is smaller than the optimal size.
      
      It is time to try to make 'leftover packets' bigger so that
      tcp_tso_should_defer() can give its full potential.
      
      After this patch, we can see the following output:
      
      14:13:34.009273 IP6 sender > receiver: Flags [P.], seq 4048380:4098360, ack 1, win 256, options [nop,nop,TS val 3425678144 ecr 1561784500], length 49980
      14:13:34.010272 IP6 sender > receiver: Flags [P.], seq 4098360:4148340, ack 1, win 256, options [nop,nop,TS val 3425678145 ecr 1561784501], length 49980
      14:13:34.011271 IP6 sender > receiver: Flags [P.], seq 4148340:4198320, ack 1, win 256, options [nop,nop,TS val 3425678146 ecr 1561784502], length 49980
      14:13:34.012271 IP6 sender > receiver: Flags [P.], seq 4198320:4248300, ack 1, win 256, options [nop,nop,TS val 3425678147 ecr 1561784503], length 49980
      14:13:34.013272 IP6 sender > receiver: Flags [P.], seq 4248300:4298280, ack 1, win 256, options [nop,nop,TS val 3425678148 ecr 1561784504], length 49980
      14:13:34.014271 IP6 sender > receiver: Flags [P.], seq 4298280:4348260, ack 1, win 256, options [nop,nop,TS val 3425678149 ecr 1561784505], length 49980
      14:13:34.015272 IP6 sender > receiver: Flags [P.], seq 4348260:4398240, ack 1, win 256, options [nop,nop,TS val 3425678150 ecr 1561784506], length 49980
      14:13:34.016270 IP6 sender > receiver: Flags [P.], seq 4398240:4448220, ack 1, win 256, options [nop,nop,TS val 3425678151 ecr 1561784507], length 49980
      14:13:34.017269 IP6 sender > receiver: Flags [P.], seq 4448220:4498200, ack 1, win 256, options [nop,nop,TS val 3425678152 ecr 1561784508], length 49980
      14:13:34.018276 IP6 sender > receiver: Flags [P.], seq 4498200:4548180, ack 1, win 256, options [nop,nop,TS val 3425678153 ecr 1561784509], length 49980
      14:13:34.019259 IP6 sender > receiver: Flags [P.], seq 4548180:4598160, ack 1, win 256, options [nop,nop,TS val 3425678154 ecr 1561784510], length 49980
      
      With 200 concurrent flows on a 100Gbit NIC, we can see a reduction
      of TSO packets (and ACK packets) of about 30 %.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20240418214600.1291486-4-edumazet@google.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8ee602c6
    • Eric Dumazet's avatar
      tcp: call tcp_set_skb_tso_segs() from tcp_write_xmit() · d5b38a71
      Eric Dumazet authored
      tcp_write_xmit() calls tcp_init_tso_segs()
      to set gso_size and gso_segs on the packet.
      
      tcp_init_tso_segs() requires the stack to maintain
      an up to date tcp_skb_pcount(), and this makes sense
      for packets in rtx queue. Not so much for packets
      still in the write queue.
      
      In the following patch, we don't want to deal with
      tcp_skb_pcount() when moving payload from 2nd
      skb to 1st skb in the write queue.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20240418214600.1291486-3-edumazet@google.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d5b38a71
    • Eric Dumazet's avatar
      tcp: remove dubious FIN exception from tcp_cwnd_test() · 22555032
      Eric Dumazet authored
      tcp_cwnd_test() has a special handing for the last packet in
      the write queue if it is smaller than one MSS and has the FIN flag.
      
      This is in violation of TCP RFC, and seems quite dubious.
      
      This packet can be sent only if the current CWND is bigger
      than the number of packets in flight.
      
      Making tcp_cwnd_test() result independent of the first skb
      in the write queue is needed for the last patch of the series.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20240418214600.1291486-2-edumazet@google.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      22555032
    • Jakub Kicinski's avatar
      Merge branch 'mlx5e-per-queue-coalescing' · f62a5e71
      Jakub Kicinski authored
      Tariq Toukan says:
      
      ====================
      mlx5e per-queue coalescing
      
      This patchset adds ethtool per-queue coalescing support for the mlx5e
      driver.
      
      The series introduce some changes needed as preparations for the final
      patch which adds the support and implements the callbacks.  Main
      changes:
      - DIM code movements into its own header file.
      - Switch to dynamic allocation of the DIM struct in the RQs/SQs.
      - Allow coalescing config change without channels reset when possible.
      ====================
      
      Link: https://lore.kernel.org/r/20240419080445.417574-1-tariqt@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f62a5e71