1. 10 Aug, 2024 10 commits
    • Nick Child's avatar
      ibmvnic: Perform tx CSO during send scrq direct · e633e32b
      Nick Child authored
      During initialization with the vnic server, a bitstring is communicated
      to the client regarding header info needed during CSO (See "VNIC
      Capabilities" in PAPR). Most of the time, to be safe, vnic server
      requests header info for CSO. When header info is needed, multiple TX
      descriptors are required per skb; This limits the driver to use
      send_subcrq_indirect instead of send_subcrq_direct.
      
      Previously, the vnic server request for header info was ignored. This
      allowed the use of send_sub_crq_direct. Transmissions were successful
      because the bitstring returned by vnic server is broad and over
      cautionary. It was observed that mlx backing devices could actually
      transmit and handle CSO packets without the vnic server receiving
      header info (despite the fact that the bitstring requested it).
      
      There was a trust issue: The bitstring was overcautionary. This extra
      precaution (requesting header info when the backing device may not use
      it) comes at the cost of performance (using direct vs indirect hcalls
      has a 30% delta in small packet RR transaction rate). So it has been
      requested that the vnic server team tries to ensure that the bitstring
      is more exact. In the meantime, disable CSO when it is possible to use
      the skb in the send_subcrq_direct path. In other words, calculate the
      checksum before handing the packet to FW when the packet is not
      segmented and xmit_more is false.
      
      Since the code path is only possible if the skb is non GSO and xmit_more
      is false, the cost of doing checksum in the send_subcrq_direct path is
      minimal. Any large segmented skb will have xmit_more set to true more
      frequently and it is inexpensive to do checksumming on a small skb.
      The worst-case workload would be a 9000 MTU TCP_RR test with close
      to MTU sized packets (and TSO off). This allows xmit_more to be false
      more frequently and open the code path up to use send_subcrq_direct.
      Observing trace data (graph-time = 1) and packet rate with this workload
      shows minimal performance degradation:
      
      1. NIC does checksum w headers, safely use send_subcrq_indirect:
        - Packet rate: 631k txs
        - Trace data:
          ibmvnic_xmit = 44344685.87 us / 6234576 hits = AVG 7.11 us
            skb_checksum_help = 4.07 us / 2 hits = AVG 2.04 us
             ^ Notice hits, tracing this just for reassurance
            ibmvnic_tx_scrq_flush = 33040649.69 us / 5638441 hits = AVG 5.86 us
              send_subcrq_indirect = 37438922.24 us / 6030859 hits = AVG 6.21 us
      
      2. NIC does checksum w/o headers, dangerously use send_subcrq_direct:
        - Packet rate: 831k txs
        - Trace data:
          ibmvnic_xmit = 48940092.29 us / 8187630 hits = AVG 5.98 us
            skb_checksum_help = 2.03 us / 1 hits = AVG 2.03
            ibmvnic_tx_scrq_flush = 31141879.57 us / 7948960 hits = AVG 3.92 us
              send_subcrq_indirect = 8412506.03 us / 728781 hits = AVG 11.54
               ^ notice hits is much lower b/c send_subcrq_direct was called
                                                  ^ wasn't traceable
      
      3. driver does checksum, safely use send_subcrq_direct (THIS PATCH):
        - Packet rate: 829k txs
        - Trace data:
          ibmvnic_xmit = 56696077.63 us / 8066168 hits = AVG 7.03 us
            skb_checksum_help = 8587456.16 us / 7526072 hits = AVG 1.14 us
            ibmvnic_tx_scrq_flush = 30219545.55 us / 7782409 hits = AVG 3.88 us
              send_subcrq_indirect = 8638326.44 us / 763693 hits = AVG 11.31 us
      
      When the bitstring ever specifies that CSO does not require headers
      (dependent on VIOS vnic server changes), then this patch should be
      removed and replaced with one that investigates the bitstring before
      using send_subcrq_direct.
      Signed-off-by: default avatarNick Child <nnac123@linux.ibm.com>
      Link: https://patch.msgid.link/20240807211809.1259563-8-nnac123@linux.ibm.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e633e32b
    • Nick Child's avatar
      ibmvnic: Only record tx completed bytes once per handler · 1c33e292
      Nick Child authored
      Byte Queue Limits depends on dql_completed being called once per tx
      completion round in order to adjust its algorithm appropriately. The
      dql->limit value is an approximation of the amount of bytes that the NIC
      can consume per irq interval. If this approximation is too high then the
      NIC will become over-saturated. Too low and the NIC will starve.
      
      The dql->limit depends on dql->prev-* stats to calculate an optimal
      value. If dql_completed() is called more than once per irq handler then
      those prev-* values become unreliable (because they are not an accurate
      representation of the previous state of the NIC) resulting in a
      sub-optimal limit value.
      
      Therefore, move the call to netdev_tx_completed_queue() to the end of
      ibmvnic_complete_tx().
      
      When performing 150 sessions of TCP rr (request-response 1 byte packets)
      workloads, one could observe:
        PREVIOUSLY: - limit and inflight values hovering around 130
                    - transaction rate of around 750k pps.
      
        NOW:        - limit rises and falls in response to inflight (130-900)
                    - transaction rate of around 1M pps (33% improvement)
      Signed-off-by: default avatarNick Child <nnac123@linux.ibm.com>
      Link: https://patch.msgid.link/20240807211809.1259563-7-nnac123@linux.ibm.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      1c33e292
    • Nick Child's avatar
      ibmvnic: Introduce send sub-crq direct · 74839f7a
      Nick Child authored
      Firmware supports two hcalls to send a sub-crq request:
      H_SEND_SUB_CRQ_INDIRECT and H_SEND_SUB_CRQ. The indirect hcall allows
      for submission of batched messages while the other hcall is limited to
      only one message. This protocol is defined in PAPR section 17.2.3.3.
      
      Previously, the ibmvnic xmit function only used the indirect hcall. This
      allowed the driver to batch it's skbs. A single skb can occupy a few
      entries per hcall depending on if FW requires skb header information or
      not. The FW only needs header information if the packet is segmented.
      
      By this logic, if an skb is not GSO then it can fit in one sub-crq
      message and therefore is a candidate for H_SEND_SUB_CRQ.
      Batching skb transmission is only useful when there are more packets
      coming down the line (ie netdev_xmit_more is true).
      
      As it turns out, H_SEND_SUB_CRQ induces less latency than
      H_SEND_SUB_CRQ_INDIRECT. Therefore, use H_SEND_SUB_CRQ where
      appropriate.
      
      Small latency gains seen when doing TCP_RR_150 (request/response
      workload). Ftrace results (graph-time=1):
        Previous:
           ibmvnic_xmit = 29618270.83 us / 8860058.0 hits = AVG 3.34
           ibmvnic_tx_scrq_flush = 21972231.02 us / 6553972.0 hits = AVG 3.35
        Now:
           ibmvnic_xmit = 22153350.96 us / 8438942.0 hits = AVG 2.63
           ibmvnic_tx_scrq_flush = 15858922.4 us / 6244076.0 hits = AVG 2.54
      Signed-off-by: default avatarNick Child <nnac123@linux.ibm.com>
      Link: https://patch.msgid.link/20240807211809.1259563-6-nnac123@linux.ibm.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      74839f7a
    • Nick Child's avatar
      ibmvnic: Remove duplicate memory barriers in tx · 6e7a5758
      Nick Child authored
      send_subcrq_[in]direct() already has a dma memory barrier.
      Remove the earlier one.
      Signed-off-by: default avatarNick Child <nnac123@linux.ibm.com>
      Link: https://patch.msgid.link/20240807211809.1259563-5-nnac123@linux.ibm.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6e7a5758
    • Nick Child's avatar
      ibmvnic: Reduce memcpys in tx descriptor generation · d95f749a
      Nick Child authored
      Previously when creating the header descriptors, the driver would:
      1. allocate a temporary buffer on the stack (in build_hdr_descs_arr)
      2. memcpy the header info into the temporary buffer (in build_hdr_data)
      3. memcpy the temp buffer into a local variable (in create_hdr_descs)
      4. copy the local variable into the return buffer (in create_hdr_descs)
      
      Since, there is no opportunity for errors during this process, the temp
      buffer is not needed and work can be done on the return buffer directly.
      
      Repurpose build_hdr_data() to only calculate the header lengths. Rename
      it to get_hdr_lens().
      Edit create_hdr_descs() to read from the skb directly and copy directly
      into the returned useful buffer.
      
      The process now involves less memory and write operations while
      also being more readable.
      Signed-off-by: default avatarNick Child <nnac123@linux.ibm.com>
      Link: https://patch.msgid.link/20240807211809.1259563-4-nnac123@linux.ibm.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d95f749a
    • Nick Child's avatar
      ibmvnic: Use header len helper functions on tx · b41b45ec
      Nick Child authored
      Use the header length helper functions rather than trying to calculate
      it within the driver. There are defined functions for mac and network
      headers (skb_mac_header_len and skb_network_header_len) but no such
      function exists for the transport header length.
      
      Also, hdr_data was memset during allocation to all 0's so no need to
      memset again.
      Signed-off-by: default avatarNick Child <nnac123@linux.ibm.com>
      Link: https://patch.msgid.link/20240807211809.1259563-3-nnac123@linux.ibm.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b41b45ec
    • Nick Child's avatar
      ibmvnic: Only replenish rx pool when resources are getting low · dda10fc8
      Nick Child authored
      Previously, the driver would replenish the rx pool if the polling
      function consumed less than the budget. The logic being that the driver
      did not exhaust its budget so that must mean that the driver is not busy
      and has cycles to spare for replenishing the pool.
      
      So pool replenishment happens on every poll which did not consume
      the budget. This can very costly during request-response tests.
      
      In fact, an extra ~100pps can be seen in TCP_RR_150 tests when we remove
      this conditional. Trace results (ftrace, graph-time=1) for the poll
      function are below:
      Previous results:
          ibmvnic_poll = 64951846.0 us / 4167628.0 hits = AVG 15.58
          replenish_rx_pool = 17602846.0 us / 4710437.0 hits = AVG 3.74
      Now:
          ibmvnic_poll = 57673941.0 us / 4791737.0 hits = AVG 12.04
          replenish_rx_pool = 3938171.6 us / 4314.0 hits = AVG 912.88
      
      While the replenish function takes longer, it is hit less frequently
      meaning the ibmvnic_poll function, on average, is faster.
      
      Furthermore, this change does not have a negative effect on
      performance bandwidth/latency measurements.
      Signed-off-by: default avatarNick Child <nnac123@linux.ibm.com>
      Link: https://patch.msgid.link/20240807211809.1259563-2-nnac123@linux.ibm.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      dda10fc8
    • Christophe Leroy's avatar
      net: fs_enet: Fix warning due to wrong type · c146f3d1
      Christophe Leroy authored
      Building fs_enet on powerpc e500 leads to following warning:
      
          CC      drivers/net/ethernet/freescale/fs_enet/mac-scc.o
        In file included from ./include/linux/build_bug.h:5,
                         from ./include/linux/container_of.h:5,
                         from ./include/linux/list.h:5,
                         from ./include/linux/module.h:12,
                         from drivers/net/ethernet/freescale/fs_enet/mac-scc.c:15:
        drivers/net/ethernet/freescale/fs_enet/mac-scc.c: In function 'allocate_bd':
        ./include/linux/err.h:28:49: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
           28 | #define IS_ERR_VALUE(x) unlikely((unsigned long)(void *)(x) >= (unsigned long)-MAX_ERRNO)
              |                                                 ^
        ./include/linux/compiler.h:77:45: note: in definition of macro 'unlikely'
           77 | # define unlikely(x)    __builtin_expect(!!(x), 0)
              |                                             ^
        drivers/net/ethernet/freescale/fs_enet/mac-scc.c:138:13: note: in expansion of macro 'IS_ERR_VALUE'
          138 |         if (IS_ERR_VALUE(fep->ring_mem_addr))
              |             ^~~~~~~~~~~~
      
      This is due to fep->ring_mem_addr not being a pointer but a DMA
      address which is 64 bits on that platform while pointers are
      32 bits as this is a 32 bits platform with wider physical bus.
      
      However, using fep->ring_mem_addr is just wrong because
      cpm_muram_alloc() returns an offset within the muram and not
      a physical address directly. So use fpi->dpram_offset instead.
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://patch.msgid.link/ec67ea3a3bef7e58b8dc959f7c17d405af0d27e4.1723101144.git.christophe.leroy@csgroup.euSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c146f3d1
    • zhangxiangqian's avatar
      net: usb: cdc_ether: don't spew notifications · 2d5c9dd2
      zhangxiangqian authored
      The usbnet_link_change function is not called, if the link has not changed.
      
      ...
      [16913.807393][ 3] cdc_ether 1-2:2.0 enx00e0995fd1ac: kevent 12 may have been dropped
      [16913.822266][ 2] cdc_ether 1-2:2.0 enx00e0995fd1ac: kevent 12 may have been dropped
      [16913.826296][ 2] cdc_ether 1-2:2.0 enx00e0995fd1ac: kevent 11 may have been dropped
      ...
      
      kevent 11 is scheduled too frequently and may affect other event schedules.
      Signed-off-by: default avatarzhangxiangqian <zhangxiangqian@kylinos.cn>
      Acked-by: default avatarOliver Neukum <oneukum@suse.com>
      Link: https://patch.msgid.link/1723109985-11996-1-git-send-email-zhangxiangqian@kylinos.cnSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      2d5c9dd2
    • Mina Almasry's avatar
      ethtool: refactor checking max channels · 916b7d31
      Mina Almasry authored
      Currently ethtool_set_channel calls separate functions to check whether
      the new channel number violates rss configuration or flow steering
      configuration.
      
      Very soon we need to check whether the new channel number violates
      memory provider configuration as well.
      
      To do all 3 checks cleanly, add a wrapper around
      ethtool_get_max_rxnfc_channel() and ethtool_get_max_rxfh_channel(),
      which does both checks. We can later extend this wrapper to add the
      memory provider check in one place.
      
      Note that in the current code, we put a descriptive genl error message
      when we run into issues. To preserve the error message, we pass the
      genl_info* to the common helper. The ioctl calls can pass NULL instead.
      Suggested-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarMina Almasry <almasrymina@google.com>
      Link: https://patch.msgid.link/20240808205345.2141858-1-almasrymina@google.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      916b7d31
  2. 09 Aug, 2024 11 commits
  3. 08 Aug, 2024 19 commits