1. 28 Sep, 2021 27 commits
  2. 27 Sep, 2021 13 commits
    • Daniel Borkmann's avatar
      Merge branch 'bpf-xsk-rx-batch' · 4c9f0937
      Daniel Borkmann authored
      Magnus Karlsson says:
      
      ====================
      This patch set introduces a batched interface for Rx buffer allocation
      in AF_XDP buffer pool. Instead of using xsk_buff_alloc(*pool), drivers
      can now use xsk_buff_alloc_batch(*pool, **xdp_buff_array,
      max). Instead of returning a pointer to an xdp_buff, it returns the
      number of xdp_buffs it managed to allocate up to the maximum value of
      the max parameter in the function call. Pointers to the allocated
      xdp_buff:s are put in the xdp_buff_array supplied in the call. This
      could be a SW ring that already exists in the driver or a new
      structure that the driver has allocated.
      
        u32 xsk_buff_alloc_batch(struct xsk_buff_pool *pool,
                                 struct xdp_buff **xdp,
                                 u32 max);
      
      When using this interface, the driver should also use the new
      interface below to set the relevant fields in the struct xdp_buff. The
      reason for this is that xsk_buff_alloc_batch() does not fill in the
      data and data_meta fields for you as is the case with
      xsk_buff_alloc(). So it is not sufficient to just set data_end
      (effectively the size) anymore in the driver. The reason for this is
      performance as explained in detail in the commit message.
      
        void xsk_buff_set_size(struct xdp_buff *xdp, u32 size);
      
      Patch 6 also optimizes the buffer allocation in the aligned case. In
      this case, we can skip the reinitialization of most fields in the
      xdp_buff_xsk struct at allocation time. As the number of elements in
      the heads array is equal to the number of possible buffers in the
      umem, we can initialize them once and for all at bind time and then
      just point to the correct one in the xdp_buff_array that is returned
      to the driver. No reason to have a stack of free head entries. In the
      unaligned case, the buffers can reside anywhere in the umem, so this
      optimization is not possible as we still have to fill in the right
      information in the xdp_buff every single time one is allocated.
      
      I have updated i40e and ice to use this new batched interface.
      
      These are the throughput results on my 2.1 GHz Cascade Lake system:
      
      Aligned mode:
      ice: +11% / -9 cycles/pkt
      i40e: +12% / -9 cycles/pkt
      
      Unaligned mode:
      ice: +1.5% / -1 cycle/pkt
      i40e: +1% / -1 cycle/pkt
      
      For the aligned case, batching provides around 40% of the performance
      improvement and the aligned optimization the rest, around 60%. Would
      have expected a ~4% boost for unaligned with this data, but I only get
      around 1%. Do not know why. Note that memory consumption in aligned
      mode is also reduced by this patch set.
      ====================
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      4c9f0937
    • Magnus Karlsson's avatar
      selftests: xsk: Add frame_headroom test · e34087fc
      Magnus Karlsson authored
      Add a test for the frame_headroom feature that can be set on the
      umem. The logic added validates that all offsets in all tests and
      packets are valid, not just the ones that have a specifically
      configured frame_headroom.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20210922075613.12186-14-magnus.karlsson@gmail.com
      e34087fc
    • Magnus Karlsson's avatar
      selftests: xsk: Change interleaving of packets in unaligned mode · e4e9baf0
      Magnus Karlsson authored
      Change the interleaving of packets in unaligned mode. With the current
      buffer addresses in the packet stream, the last buffer in the umem
      could not be used as a large packet could potentially write over the
      end of the umem. The kernel correctly threw this buffer address away
      and refused to use it. This is perfectly fine for all regular packet
      streams, but the ones used for unaligned mode have every other packet
      being at some different offset. As we will add checks for correct
      offsets in the next patch, this needs to be fixed. Just start these
      page-boundary straddling buffers one page earlier so that the last
      one is not on the last page of the umem, making all buffers valid.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20210922075613.12186-13-magnus.karlsson@gmail.com
      e4e9baf0
    • Magnus Karlsson's avatar
      selftests: xsk: Add single packet test · 96a40678
      Magnus Karlsson authored
      Add a test where a single packet is sent and received. This might
      sound like a silly test, but since many of the interfaces in xsk are
      batched, it is important to be able to validate that we did not break
      something as fundamental as just receiving single packets, instead of
      batches of packets at high speed.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20210922075613.12186-12-magnus.karlsson@gmail.com
      96a40678
    • Magnus Karlsson's avatar
      selftests: xsk: Introduce pacing of traffic · 1bf36496
      Magnus Karlsson authored
      Introduce pacing of traffic so that the Tx thread can never send more
      packets than the receiver has processed plus the number of packets it
      can have in its umem. So at any point in time, the number of in flight
      packets (not processed by the Rx thread) are less than or equal to the
      number of packets that can be held in the Rx thread's umem.
      
      The batch size is also increased to improve running time.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20210922075613.12186-11-magnus.karlsson@gmail.com
      1bf36496
    • Magnus Karlsson's avatar
      selftests: xsk: Fix socket creation retry · 89013b8a
      Magnus Karlsson authored
      The socket creation retry unnecessarily registered the umem once for
      every retry. No reason to do this. It wastes memory and it might lead
      to too many pages being locked at some point and the failure of a
      test.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20210922075613.12186-10-magnus.karlsson@gmail.com
      89013b8a
    • Magnus Karlsson's avatar
      selftests: xsk: Put the same buffer only once in the fill ring · 872a1184
      Magnus Karlsson authored
      Fix a problem where the fill ring was populated with too many
      entries. If number of buffers in the umem was smaller than the fill
      ring size, the code used to loop over from the beginning of the umem
      and start putting the same buffers in again. This is racy indeed as a
      later packet can be received overwriting an earlier one before the Rx
      thread manages to validate it.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20210922075613.12186-9-magnus.karlsson@gmail.com
      872a1184
    • Magnus Karlsson's avatar
      selftests: xsk: Fix missing initialization · 5b132056
      Magnus Karlsson authored
      Fix missing initialization of the member rx_pkt_nb in the packet
      stream. This leads to some tests declaring success too early as the
      test thought all packets had already been received.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20210922075613.12186-8-magnus.karlsson@gmail.com
      5b132056
    • Magnus Karlsson's avatar
      xsk: Optimize for aligned case · 94033cd8
      Magnus Karlsson authored
      Optimize for the aligned case by precomputing the parameter values of
      the xdp_buff_xsk and xdp_buff structures in the heads array. We can do
      this as the heads array size is equal to the number of chunks in the
      umem for the aligned case. Then every entry in this array will reflect
      a certain chunk/frame and can therefore be prepopulated with the
      correct values and we can drop the use of the free_heads stack. Note
      that it is not possible to allocate more buffers than what has been
      allocated in the aligned case since each chunk can only contain a
      single buffer.
      
      We can unfortunately not do this in the unaligned case as one chunk
      might contain multiple buffers. In this case, we keep the old scheme
      of populating a heads entry every time it is used and using
      the free_heads stack.
      
      Also move xp_release() and xp_get_handle() to xsk_buff_pool.h. They
      were for some reason in xsk.c even though they are buffer pool
      operations.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20210922075613.12186-7-magnus.karlsson@gmail.com
      94033cd8
    • Magnus Karlsson's avatar
      i40e: Use the xsk batched rx allocation interface · 6aab0bb0
      Magnus Karlsson authored
      Use the new xsk batched rx allocation interface for the zero-copy data
      path. As the array of struct xdp_buff pointers kept by the driver is
      really a ring that wraps, the allocation routine is modified to detect
      a wrap and in that case call the allocation function twice. The
      allocation function cannot deal with wrapped rings, only arrays. As we
      now know exactly how many buffers we get and that there is no
      wrapping, the allocation function can be simplified even more as all
      if-statements in the allocation loop can be removed, improving
      performance.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20210922075613.12186-6-magnus.karlsson@gmail.com
      6aab0bb0
    • Magnus Karlsson's avatar
      ice: Use the xsk batched rx allocation interface · db804cfc
      Magnus Karlsson authored
      Use the new xsk batched rx allocation interface for the zero-copy data
      path. As the array of struct xdp_buff pointers kept by the driver is
      really a ring that wraps, the allocation routine is modified to detect
      a wrap and in that case call the allocation function twice. The
      allocation function cannot deal with wrapped rings, only arrays. As we
      now know exactly how many buffers we get and that there is no
      wrapping, the allocation function can be simplified even more as all
      if-statements in the allocation loop can be removed, improving
      performance.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20210922075613.12186-5-magnus.karlsson@gmail.com
      db804cfc
    • Magnus Karlsson's avatar
      ice: Use xdp_buf instead of rx_buf for xsk zero-copy · 57f7f8b6
      Magnus Karlsson authored
      In order to use the new xsk batched buffer allocation interface, a
      pointer to an array of struct xsk_buff pointers need to be provided so
      that the function can put the result of the allocation there. In the
      ice driver, we already have a ring that stores pointers to
      xdp_buffs. This is only used for the xsk zero-copy driver and is a
      union with the structure that is used for the regular non zero-copy
      path. Unfortunately, that structure is larger than the xdp_buffs
      pointers which mean that there will be a stride (of 20 bytes) between
      each xdp_buff pointer. And feeding this into the xsk_buff_alloc_batch
      interface will not work since it assumes a regular array of xdp_buff
      pointers (each 8 bytes with 0 bytes in-between them on a 64-bit
      system).
      
      To fix this, remove the xdp_buff pointer from the rx_buf union and
      move it one step higher to the union above which only has pointers to
      arrays in it. This solves the problem and we can directly feed the SW
      ring of xdp_buff pointers straight into the allocation function in the
      next patch when that interface is used. This will improve performance.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20210922075613.12186-4-magnus.karlsson@gmail.com
      57f7f8b6
    • Magnus Karlsson's avatar
      xsk: Batched buffer allocation for the pool · 47e4075d
      Magnus Karlsson authored
      Add a new driver interface xsk_buff_alloc_batch() offering batched
      buffer allocations to improve performance. The new interface takes
      three arguments: the buffer pool to allocated from, a pointer to an
      array of struct xdp_buff pointers which will contain pointers to the
      allocated xdp_buffs, and an unsigned integer specifying the max number
      of buffers to allocate. The return value is the actual number of
      buffers that the allocator managed to allocate and it will be in the
      range 0 <= N <= max, where max is the third parameter to the function.
      
      u32 xsk_buff_alloc_batch(struct xsk_buff_pool *pool, struct xdp_buff **xdp,
                               u32 max);
      
      A second driver interface is also introduced that need to be used in
      conjunction with xsk_buff_alloc_batch(). It is a helper that sets the
      size of struct xdp_buff and is used by the NIC Rx irq routine when
      receiving a packet. This helper sets the three struct members data,
      data_meta, and data_end. The two first ones is in the xsk_buff_alloc()
      case set in the allocation routine and data_end is set when a packet
      is received in the receive irq function. This unfortunately leads to
      worse performance since the xdp_buff is touched twice with a long time
      period in between leading to an extra cache miss. Instead, we fill out
      the xdp_buff with all 3 fields at one single point in time in the
      driver, when the size of the packet is known. Hence this helper. Note
      that the driver has to use this helper (or set all three fields
      itself) when using xsk_buff_alloc_batch(). xsk_buff_alloc() works as
      before and does not require this.
      
      void xsk_buff_set_size(struct xdp_buff *xdp, u32 size);
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20210922075613.12186-3-magnus.karlsson@gmail.com
      47e4075d