1. 16 Oct, 2021 33 commits
  2. 15 Oct, 2021 7 commits
    • Maciej Fijalkowski's avatar
      ice: make use of ice_for_each_* macros · 2faf63b6
      Maciej Fijalkowski authored
      Go through the code base and use ice_for_each_* macros.  While at it,
      introduce ice_for_each_xdp_txq() macro that can be used for looping over
      xdp_rings array.
      
      Commit is not introducing any new functionality.
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Tested-by: default avatarGurucharan G <gurucharanx.g@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      2faf63b6
    • Maciej Fijalkowski's avatar
      ice: introduce XDP_TX fallback path · 22bf877e
      Maciej Fijalkowski authored
      Under rare circumstances there might be a situation where a requirement
      of having XDP Tx queue per CPU could not be fulfilled and some of the Tx
      resources have to be shared between CPUs. This yields a need for placing
      accesses to xdp_ring inside a critical section protected by spinlock.
      These accesses happen to be in the hot path, so let's introduce the
      static branch that will be triggered from the control plane when driver
      could not provide Tx queue dedicated for XDP on each CPU.
      
      Currently, the design that has been picked is to allow any number of XDP
      Tx queues that is at least half of a count of CPUs that platform has.
      For lower number driver will bail out with a response to user that there
      were not enough Tx resources that would allow configuring XDP. The
      sharing of rings is signalled via static branch enablement which in turn
      indicates that lock for xdp_ring accesses needs to be taken in hot path.
      
      Approach based on static branch has no impact on performance of a
      non-fallback path. One thing that is needed to be mentioned is a fact
      that the static branch will act as a global driver switch, meaning that
      if one PF got out of Tx resources, then other PFs that ice driver is
      servicing will suffer. However, given the fact that HW that ice driver
      is handling has 1024 Tx queues per each PF, this is currently an
      unlikely scenario.
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Tested-by: default avatarGeorge Kuruvinakunnel <george.kuruvinakunnel@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      22bf877e
    • Maciej Fijalkowski's avatar
      ice: optimize XDP_TX workloads · 9610bd98
      Maciej Fijalkowski authored
      Optimize Tx descriptor cleaning for XDP. Current approach doesn't
      really scale and chokes when multiple flows are handled.
      
      Introduce two ring fields, @next_dd and @next_rs that will keep track of
      descriptor that should be looked at when the need for cleaning arise and
      the descriptor that should have the RS bit set, respectively.
      
      Note that at this point the threshold is a constant (32), but it is
      something that we could make configurable.
      
      First thing is to get away from setting RS bit on each descriptor. Let's
      do this only once NTU is higher than the currently @next_rs value. In
      such case, grab the tx_desc[next_rs], set the RS bit in descriptor and
      advance the @next_rs by a 32.
      
      Second thing is to clean the Tx ring only when there are less than 32
      free entries. For that case, look up the tx_desc[next_dd] for a DD bit.
      This bit is written back by HW to let the driver know that xmit was
      successful. It will happen only for those descriptors that had RS bit
      set. Clean only 32 descriptors and advance the DD bit.
      
      Actual cleaning routine is moved from ice_napi_poll() down to the
      ice_xmit_xdp_ring(). It is safe to do so as XDP ring will not get any
      SKBs in there that would rely on interrupts for the cleaning. Nice side
      effect is that for rare case of Tx fallback path (that next patch is
      going to introduce) we don't have to trigger the SW irq to clean the
      ring.
      
      With those two concepts, ring is kept at being almost full, but it is
      guaranteed that driver will be able to produce Tx descriptors.
      
      This approach seems to work out well even though the Tx descriptors are
      produced in one-by-one manner. Test was conducted with the ice HW
      bombarded with packets from HW generator, configured to generate 30
      flows.
      
      Xdp2 sample yields the following results:
      <snip>
      proto 17:   79973066 pkt/s
      proto 17:   80018911 pkt/s
      proto 17:   80004654 pkt/s
      proto 17:   79992395 pkt/s
      proto 17:   79975162 pkt/s
      proto 17:   79955054 pkt/s
      proto 17:   79869168 pkt/s
      proto 17:   79823947 pkt/s
      proto 17:   79636971 pkt/s
      </snip>
      
      As that sample reports the Rx'ed frames, let's look at sar output.
      It says that what we Rx'ed we do actually Tx, no noticeable drops.
      Average:        IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s txcmp/s  rxmcst/s   %ifutil
      Average:       ens4f1 79842324.00 79842310.40 4678261.17 4678260.38 0.00      0.00      0.00     38.32
      
      with tx_busy staying calm.
      
      When compared to a state before:
      Average:        IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s txcmp/s  rxmcst/s   %ifutil
      Average:       ens4f1 90919711.60 42233822.60 5327326.85 2474638.04 0.00      0.00      0.00     43.64
      
      it can be observed that the amount of txpck/s is almost doubled, meaning
      that the performance is improved by around 90%. All of this due to the
      drops in the driver, previously the tx_busy stat was bumped at a 7mpps
      rate.
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Tested-by: default avatarGeorge Kuruvinakunnel <george.kuruvinakunnel@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      9610bd98
    • Maciej Fijalkowski's avatar
      ice: propagate xdp_ring onto rx_ring · eb087cd8
      Maciej Fijalkowski authored
      With rings being split, it is now convenient to introduce a pointer to
      XDP ring within the Rx ring. For XDP_TX workloads this means that
      xdp_rings array access will be skipped, which was executed per each
      processed frame.
      
      Also, read the XDP prog once per NAPI and if prog is present, set up the
      local xdp_ring pointer. Reading prog a single time was discussed in [1]
      with some concern raised by Toke around dispatcher handling and having
      the need for going through the RCU grace period in the ndo_bpf driver
      callback, but ice currently is torning down NAPI instances regardless of
      the prog presence on VSI.
      
      Although the pointer to XDP ring introduced to Rx ring makes things a
      lot slimmer/simpler, I still feel that single prog read per NAPI
      lifetime is beneficial.
      
      Further patch that will introduce the fallback path will also get a
      profit from that as xdp_ring pointer will be set during the XDP rings
      setup.
      
      [1]: https://lore.kernel.org/bpf/87k0oseo6e.fsf@toke.dk/Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Tested-by: default avatarGeorge Kuruvinakunnel <george.kuruvinakunnel@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      eb087cd8
    • Maciej Fijalkowski's avatar
      ice: do not create xdp_frame on XDP_TX · a55e16fa
      Maciej Fijalkowski authored
      xdp_frame is not needed for XDP_TX data path in ice driver case.
      For this data path cleaning of sent descriptor will not happen anywhere
      outside of the driver, which means that carrying the information about
      the underlying memory model via xdp_frame will not be used. Therefore,
      this conversion can be simply dropped, which would relieve CPU a bit.
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Tested-by: default avatarGeorge Kuruvinakunnel <george.kuruvinakunnel@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      a55e16fa
    • Maciej Fijalkowski's avatar
      ice: unify xdp_rings accesses · 0bb4f9ec
      Maciej Fijalkowski authored
      There has been a long lasting issue of improper xdp_rings indexing for
      XDP_TX and XDP_REDIRECT actions. Given that currently rx_ring->q_index
      is mixed with smp_processor_id(), there could be a situation where Tx
      descriptors are produced onto XDP Tx ring, but tail is never bumped -
      for example pin a particular queue id to non-matching IRQ line.
      
      Address this problem by ignoring the user ring count setting and always
      initialize the xdp_rings array to be of num_possible_cpus() size. Then,
      always use the smp_processor_id() as an index to xdp_rings array. This
      provides serialization as at given time only a single softirq can run on
      a particular CPU.
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Tested-by: default avatarGeorge Kuruvinakunnel <george.kuruvinakunnel@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      0bb4f9ec
    • Maciej Fijalkowski's avatar
      ice: split ice_ring onto Tx/Rx separate structs · e72bba21
      Maciej Fijalkowski authored
      While it was convenient to have a generic ring structure that served
      both Tx and Rx sides, next commits are going to introduce several
      Tx-specific fields, so in order to avoid hurting the Rx side, let's
      pull out the Tx ring onto new ice_tx_ring and ice_rx_ring structs.
      
      Rx ring could be handled by the old ice_ring which would reduce the code
      churn within this patch, but this would make things asymmetric.
      
      Make the union out of the ring container within ice_q_vector so that it
      is possible to iterate over newly introduced ice_tx_ring.
      
      Remove the @size as it's only accessed from control path and it can be
      calculated pretty easily.
      
      Change definitions of ice_update_ring_stats and
      ice_fetch_u64_stats_per_ring so that they are ring agnostic and can be
      used for both Rx and Tx rings.
      
      Sizes of Rx and Tx ring structs are 256 and 192 bytes, respectively. In
      Rx ring xdp_rxq_info occupies its own cacheline, so it's the major
      difference now.
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Tested-by: default avatarGurucharan G <gurucharanx.g@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      e72bba21