1. 02 Mar, 2022 5 commits
    • Harold Huang's avatar
      tun: support NAPI for packets received from batched XDP buffs · fb3f9037
      Harold Huang authored
      In tun, NAPI is supported and we can also use NAPI in the path of
      batched XDP buffs to accelerate packet processing. What is more, after
      we use NAPI, GRO is also supported. The iperf shows that the throughput of
      single stream could be improved from 4.5Gbps to 9.2Gbps. Additionally, 9.2
      Gbps nearly reachs the line speed of the phy nic and there is still about
      15% idle cpu core remaining on the vhost thread.
      
      Test topology:
      [iperf server]<--->tap<--->dpdk testpmd<--->phy nic<--->[iperf client]
      
      Iperf stream:
      iperf3 -c 10.0.0.2  -i 1 -t 10
      
      Before:
      ...
      [  5]   5.00-6.00   sec   558 MBytes  4.68 Gbits/sec    0   1.50 MBytes
      [  5]   6.00-7.00   sec   556 MBytes  4.67 Gbits/sec    1   1.35 MBytes
      [  5]   7.00-8.00   sec   556 MBytes  4.67 Gbits/sec    2   1.18 MBytes
      [  5]   8.00-9.00   sec   559 MBytes  4.69 Gbits/sec    0   1.48 MBytes
      [  5]   9.00-10.00  sec   556 MBytes  4.67 Gbits/sec    1   1.33 MBytes
      - - - - - - - - - - - - - - - - - - - - - - - - -
      [ ID] Interval           Transfer     Bitrate         Retr
      [  5]   0.00-10.00  sec  5.39 GBytes  4.63 Gbits/sec   72          sender
      [  5]   0.00-10.04  sec  5.39 GBytes  4.61 Gbits/sec               receiver
      
      After:
      ...
      [  5]   5.00-6.00   sec  1.07 GBytes  9.19 Gbits/sec    0   1.55 MBytes
      [  5]   6.00-7.00   sec  1.08 GBytes  9.30 Gbits/sec    0   1.63 MBytes
      [  5]   7.00-8.00   sec  1.08 GBytes  9.25 Gbits/sec    0   1.72 MBytes
      [  5]   8.00-9.00   sec  1.08 GBytes  9.25 Gbits/sec   77   1.31 MBytes
      [  5]   9.00-10.00  sec  1.08 GBytes  9.24 Gbits/sec    0   1.48 MBytes
      - - - - - - - - - - - - - - - - - - - - - - - - -
      [ ID] Interval           Transfer     Bitrate         Retr
      [  5]   0.00-10.00  sec  10.8 GBytes  9.28 Gbits/sec  166          sender
      [  5]   0.00-10.04  sec  10.8 GBytes  9.24 Gbits/sec               receiver
      
      Reported-at: https://lore.kernel.org/all/CACGkMEvTLG0Ayg+TtbN4q4pPW-ycgCCs3sC3-TF8cuRTf7Pp1A@mail.gmail.comSigned-off-by: default avatarHarold Huang <baymaxhuang@gmail.com>
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Link: https://lore.kernel.org/r/20220228033805.1579435-1-baymaxhuang@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      fb3f9037
    • Jakub Kicinski's avatar
      Merge branch 'sfc-optimize-rxqs-count-and-affinities' · 422ce836
      Jakub Kicinski authored
      Íñigo Huguet says:
      
      ====================
      sfc: optimize RXQs count and affinities
      
      In sfc driver one RX queue per physical core was allocated by default.
      Later on, IRQ affinities were set spreading the IRQs in all NUMA local
      CPUs.
      
      However, with that default configuration it result in a non very optimal
      configuration in many modern systems. Specifically, in systems with hyper
      threading and 2 NUMA nodes, affinities are set in a way that IRQs are
      handled by all logical cores of one same NUMA node. Handling IRQs from
      both hyper threading siblings has no benefit, and setting affinities to one
      queue per physical core is neither a very good idea because there is a
      performance penalty for moving data across nodes (I was able to check it
      with some XDP tests using pktgen).
      
      This patches reduce the default number of channels to one per physical
      core in the local NUMA node. Then, they set IRQ affinities to CPUs in
      the local NUMA node only. This way we save hardware resources since
      channels are limited resources. We also leave more room for XDP_TX
      channels without hitting driver's limit of 32 channels per interface.
      
      Running performance tests using iperf with a SFC9140 device showed no
      performance penalty for reducing the number of channels.
      
      RX XDP tests showed that performance can go down to less than half if
      the IRQ is handled by a CPU in a different NUMA node, which doesn't
      happen with the new defaults from this patches.
      ====================
      
      Link: https://lore.kernel.org/r/20220228132254.25787-1-ihuguet@redhat.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      422ce836
    • Íñigo Huguet's avatar
      sfc: set affinity hints in local NUMA node only · 09a99ab1
      Íñigo Huguet authored
      Affinity hints were being set to CPUs in local NUMA node first, and then
      in other CPUs. This was creating 2 unintended issues:
      1. Channels created to be assigned each to a different physical core
         were assigned to hyperthreading siblings because of being in same
         NUMA node.
         Since the patch previous to this one, this did not longer happen
         with default rss_cpus modparam because less channels are created.
      2. XDP channels could be assigned to CPUs in different NUMA nodes,
         decreasing performance too much (to less than half in some of my
         tests).
      
      This patch sets the affinity hints spreading the channels only in local
      NUMA node's CPUs. A fallback for the case that no CPU in local NUMA node
      is online has been added too.
      
      Example of CPUs being assigned in a non optimal way before this and the
      previous patch (note: in this system, xdp-8 to xdp-15 are created
      because num_possible_cpus == 64, but num_present_cpus == 32 so they're
      never used):
      
      $ lscpu | grep -i numa
      NUMA node(s):                    2
      NUMA node0 CPU(s):               0-7,16-23
      NUMA node1 CPU(s):               8-15,24-31
      
      $ grep -H . /proc/irq/*/0000:07:00.0*/../smp_affinity_list
      /proc/irq/141/0000:07:00.0-0/../smp_affinity_list:0
      /proc/irq/142/0000:07:00.0-1/../smp_affinity_list:1
      /proc/irq/143/0000:07:00.0-2/../smp_affinity_list:2
      /proc/irq/144/0000:07:00.0-3/../smp_affinity_list:3
      /proc/irq/145/0000:07:00.0-4/../smp_affinity_list:4
      /proc/irq/146/0000:07:00.0-5/../smp_affinity_list:5
      /proc/irq/147/0000:07:00.0-6/../smp_affinity_list:6
      /proc/irq/148/0000:07:00.0-7/../smp_affinity_list:7
      /proc/irq/149/0000:07:00.0-8/../smp_affinity_list:16
      /proc/irq/150/0000:07:00.0-9/../smp_affinity_list:17
      /proc/irq/151/0000:07:00.0-10/../smp_affinity_list:18
      /proc/irq/152/0000:07:00.0-11/../smp_affinity_list:19
      /proc/irq/153/0000:07:00.0-12/../smp_affinity_list:20
      /proc/irq/154/0000:07:00.0-13/../smp_affinity_list:21
      /proc/irq/155/0000:07:00.0-14/../smp_affinity_list:22
      /proc/irq/156/0000:07:00.0-15/../smp_affinity_list:23
      /proc/irq/157/0000:07:00.0-xdp-0/../smp_affinity_list:8
      /proc/irq/158/0000:07:00.0-xdp-1/../smp_affinity_list:9
      /proc/irq/159/0000:07:00.0-xdp-2/../smp_affinity_list:10
      /proc/irq/160/0000:07:00.0-xdp-3/../smp_affinity_list:11
      /proc/irq/161/0000:07:00.0-xdp-4/../smp_affinity_list:12
      /proc/irq/162/0000:07:00.0-xdp-5/../smp_affinity_list:13
      /proc/irq/163/0000:07:00.0-xdp-6/../smp_affinity_list:14
      /proc/irq/164/0000:07:00.0-xdp-7/../smp_affinity_list:15
      /proc/irq/165/0000:07:00.0-xdp-8/../smp_affinity_list:24
      /proc/irq/166/0000:07:00.0-xdp-9/../smp_affinity_list:25
      /proc/irq/167/0000:07:00.0-xdp-10/../smp_affinity_list:26
      /proc/irq/168/0000:07:00.0-xdp-11/../smp_affinity_list:27
      /proc/irq/169/0000:07:00.0-xdp-12/../smp_affinity_list:28
      /proc/irq/170/0000:07:00.0-xdp-13/../smp_affinity_list:29
      /proc/irq/171/0000:07:00.0-xdp-14/../smp_affinity_list:30
      /proc/irq/172/0000:07:00.0-xdp-15/../smp_affinity_list:31
      
      CPUs assignments after this and previous patch, so normal channels
      created only one per core in NUMA node and affinities set only to local
      NUMA node:
      
      $ grep -H . /proc/irq/*/0000:07:00.0*/../smp_affinity_list
      /proc/irq/116/0000:07:00.0-0/../smp_affinity_list:0
      /proc/irq/117/0000:07:00.0-1/../smp_affinity_list:1
      /proc/irq/118/0000:07:00.0-2/../smp_affinity_list:2
      /proc/irq/119/0000:07:00.0-3/../smp_affinity_list:3
      /proc/irq/120/0000:07:00.0-4/../smp_affinity_list:4
      /proc/irq/121/0000:07:00.0-5/../smp_affinity_list:5
      /proc/irq/122/0000:07:00.0-6/../smp_affinity_list:6
      /proc/irq/123/0000:07:00.0-7/../smp_affinity_list:7
      /proc/irq/124/0000:07:00.0-xdp-0/../smp_affinity_list:16
      /proc/irq/125/0000:07:00.0-xdp-1/../smp_affinity_list:17
      /proc/irq/126/0000:07:00.0-xdp-2/../smp_affinity_list:18
      /proc/irq/127/0000:07:00.0-xdp-3/../smp_affinity_list:19
      /proc/irq/128/0000:07:00.0-xdp-4/../smp_affinity_list:20
      /proc/irq/129/0000:07:00.0-xdp-5/../smp_affinity_list:21
      /proc/irq/130/0000:07:00.0-xdp-6/../smp_affinity_list:22
      /proc/irq/131/0000:07:00.0-xdp-7/../smp_affinity_list:23
      /proc/irq/132/0000:07:00.0-xdp-8/../smp_affinity_list:0
      /proc/irq/133/0000:07:00.0-xdp-9/../smp_affinity_list:1
      /proc/irq/134/0000:07:00.0-xdp-10/../smp_affinity_list:2
      /proc/irq/135/0000:07:00.0-xdp-11/../smp_affinity_list:3
      /proc/irq/136/0000:07:00.0-xdp-12/../smp_affinity_list:4
      /proc/irq/137/0000:07:00.0-xdp-13/../smp_affinity_list:5
      /proc/irq/138/0000:07:00.0-xdp-14/../smp_affinity_list:6
      /proc/irq/139/0000:07:00.0-xdp-15/../smp_affinity_list:7
      Signed-off-by: default avatarÍñigo Huguet <ihuguet@redhat.com>
      Acked-by: default avatarMartin Habets <habetsm.xilinx@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      09a99ab1
    • Íñigo Huguet's avatar
      sfc: default config to 1 channel/core in local NUMA node only · c265b569
      Íñigo Huguet authored
      Handling channels from CPUs in different NUMA node can penalize
      performance, so better configure only one channel per core in the same
      NUMA node than the NIC, and not per each core in the system.
      
      Fallback to all other online cores if there are not online CPUs in local
      NUMA node.
      Signed-off-by: default avatarÍñigo Huguet <ihuguet@redhat.com>
      Acked-by: default avatarMartin Habets <habetsm.xilinx@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c265b569
    • Jakub Kicinski's avatar
      net: smc: fix different types in min() · ef739f1d
      Jakub Kicinski authored
      Fix build:
      
       include/linux/minmax.h:45:25: note: in expansion of macro ‘__careful_cmp’
         45 | #define min(x, y)       __careful_cmp(x, y, <)
            |                         ^~~~~~~~~~~~~
       net/smc/smc_tx.c:150:24: note: in expansion of macro ‘min’
        150 |         corking_size = min(sock_net(&smc->sk)->smc.sysctl_autocorking_size,
            |                        ^~~
      
      Fixes: 12bbb0d1 ("net/smc: add sysctl for autocorking")
      Link: https://lore.kernel.org/r/20220301222446.1271127-1-kuba@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ef739f1d
  2. 01 Mar, 2022 22 commits
    • David S. Miller's avatar
      Merge branch 'smc-datapath-opts' · 7282c126
      David S. Miller authored
      Dust Li says:
      
      ====================
      net/smc: some datapath performance optimizations
      
      This series tries to improve the performance of SMC in datapath.
      
      - patch #1, add sysctl interface to support tuning the behaviour of
        SMC in container environment.
      
      - patch #2/#3, add autocorking support which is very efficient for small
        messages without trade-off for latency.
      
      - patch #4, send directly on setting TCP_NODELAY, without wake up the
        TX worker, this make it consistent with clearing TCP_CORK.
      
      - patch #5, this correct the setting of RMB window update limit, so
        we don't send CDC messages to update peer's RMB window too frequently
        in some cases.
      
      - patch #6, implemented something like NAPI in SMC, decrease the number
        of hardirq when busy.
      
      - patch #7, this moves TX work doing in the BH to the user context when
        sock_lock is hold by user.
      
      With this patchset applied, we can get a good performance gain:
      - qperf tcp_bw test has shown a great improvement. Other benchmarks like
        'netperf TCP_STREAM' or 'sockperf throughput' has similar result.
      - In my testing environment, running qperf tcp_bw and tcp_lat, SMC behaves
        better then TCP in most all message size.
      
      Here are some test results with the following testing command:
      client: smc_run taskset -c 1 qperf smc-server -oo msg_size:1:64K:*2 \
      		-t 30 -vu tcp_{bw|lat}
      server: smc_run taskset -c 1 qperf
      
      ==== Bandwidth ====
       MsgSize        Origin SMC              TCP                SMC with patches
             1         0.578 MB/s      2.392 MB/s(313.57%)      2.561 MB/s(342.83%)
             2         1.159 MB/s      4.780 MB/s(312.53%)      5.162 MB/s(345.46%)
             4         2.283 MB/s     10.266 MB/s(349.77%)     10.122 MB/s(343.46%)
             8         4.668 MB/s     19.040 MB/s(307.86%)     20.521 MB/s(339.59%)
            16         9.147 MB/s     38.904 MB/s(325.31%)     40.823 MB/s(346.29%)
            32        18.369 MB/s     79.587 MB/s(333.25%)     80.535 MB/s(338.42%)
            64        36.562 MB/s    148.668 MB/s(306.61%)    158.170 MB/s(332.60%)
           128        72.961 MB/s    274.913 MB/s(276.80%)    316.217 MB/s(333.41%)
           256       144.705 MB/s    512.059 MB/s(253.86%)    626.019 MB/s(332.62%)
           512       288.873 MB/s    884.977 MB/s(206.35%)   1221.596 MB/s(322.88%)
          1024       574.180 MB/s   1337.736 MB/s(132.98%)   2203.156 MB/s(283.70%)
          2048      1095.192 MB/s   1865.952 MB/s( 70.38%)   3036.448 MB/s(177.25%)
          4096      2066.157 MB/s   2380.337 MB/s( 15.21%)   3834.271 MB/s( 85.58%)
          8192      3717.198 MB/s   2733.073 MB/s(-26.47%)   4904.910 MB/s( 31.95%)
         16384      4742.221 MB/s   2958.693 MB/s(-37.61%)   5220.272 MB/s( 10.08%)
         32768      5349.550 MB/s   3061.285 MB/s(-42.77%)   5321.865 MB/s( -0.52%)
         65536      5162.919 MB/s   3731.408 MB/s(-27.73%)   5245.021 MB/s(  1.59%)
      ==== Latency ====
       MsgSize        Origin SMC              TCP                SMC with patches
             1        10.540 us     11.938 us( 13.26%)         10.356 us( -1.75%)
             2        10.996 us     11.992 us(  9.06%)         10.073 us( -8.39%)
             4        10.229 us     11.687 us( 14.25%)          9.996 us( -2.28%)
             8        10.203 us     11.653 us( 14.21%)         10.063 us( -1.37%)
            16        10.530 us     11.313 us(  7.44%)         10.013 us( -4.91%)
            32        10.241 us     11.586 us( 13.13%)         10.081 us( -1.56%)
            64        10.693 us     11.652 us(  8.97%)          9.986 us( -6.61%)
           128        10.597 us     11.579 us(  9.27%)         10.262 us( -3.16%)
           256        10.409 us     11.957 us( 14.87%)         10.148 us( -2.51%)
           512        11.088 us     12.505 us( 12.78%)         10.206 us( -7.95%)
          1024        11.240 us     12.255 us(  9.03%)         10.631 us( -5.42%)
          2048        11.485 us     16.970 us( 47.76%)         10.981 us( -4.39%)
          4096        12.077 us     13.948 us( 15.49%)         11.847 us( -1.90%)
          8192        13.683 us     16.693 us( 22.00%)         13.336 us( -2.54%)
         16384        16.470 us     23.615 us( 43.38%)         16.519 us(  0.30%)
         32768        22.540 us     40.966 us( 81.75%)         22.452 us( -0.39%)
         65536        34.192 us     73.003 us(113.51%)         33.916 us( -0.81%)
      
      ------------
      Test environment notes:
      1. Testing is run on 2 VMs within the same physical host
      2. The NIC is ConnectX-4Lx, using SRIOV, and passing through 2 VFs to the
         2 VMs respectively.
      3. To decrease jitter, VM's vCPU are binded to each physical CPU, and those
         physical CPUs are all isolated using boot parameter `isolcpus=xxx`
      4. The queue number are set to 1, and interrupt from the queue is binded to
         CPU0 in the guest
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7282c126
    • Dust Li's avatar
      net/smc: don't send in the BH context if sock_owned_by_user · 6b88af83
      Dust Li authored
      Send data all the way down to the RDMA device is a time
      consuming operation(get a new slot, maybe do RDMA Write
      and send a CDC, etc). Moving those operations from BH
      to user context is good for performance.
      
      If the sock_lock is hold by user, we don't try to send
      data out in the BH context, but just mark we should
      send. Since the user will release the sock_lock soon, we
      can do the sending there.
      
      Add smc_release_cb() which will be called in release_sock()
      and try send in the callback if needed.
      
      This patch moves the sending part out from BH if sock lock
      is hold by user. In my testing environment, this saves about
      20% softirq in the qperf 4K tcp_bw test in the sender side
      with no noticeable throughput drop.
      Signed-off-by: default avatarDust Li <dust.li@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6b88af83
    • Dust Li's avatar
      net/smc: don't req_notify until all CQEs drained · a505cce6
      Dust Li authored
      When we are handling softirq workload, enable hardirq may
      again interrupt the current routine of softirq, and then
      try to raise softirq again. This only wastes CPU cycles
      and won't have any real gain.
      
      Since IB_CQ_REPORT_MISSED_EVENTS already make sure if
      ib_req_notify_cq() returns 0, it is safe to wait for the
      next event, with no need to poll the CQ again in this case.
      
      This patch disables hardirq during the processing of softirq,
      and re-arm the CQ after softirq is done. Somehow like NAPI.
      Co-developed-by: default avatarGuangguan Wang <guangguan.wang@linux.alibaba.com>
      Signed-off-by: default avatarGuangguan Wang <guangguan.wang@linux.alibaba.com>
      Signed-off-by: default avatarDust Li <dust.li@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a505cce6
    • Dust Li's avatar
      net/smc: correct settings of RMB window update limit · 6bf536eb
      Dust Li authored
      rmbe_update_limit is used to limit announcing receive
      window updating too frequently. RFC7609 request a minimal
      increase in the window size of 10% of the receive buffer
      space. But current implementation used:
      
        min_t(int, rmbe_size / 10, SOCK_MIN_SNDBUF / 2)
      
      and SOCK_MIN_SNDBUF / 2 == 2304 Bytes, which is almost
      always less then 10% of the receive buffer space.
      
      This causes the receiver always sending CDC message to
      update its consumer cursor when it consumes more then 2K
      of data. And as a result, we may encounter something like
      "TCP silly window syndrome" when sending 2.5~8K message.
      
      This patch fixes this using max(rmbe_size / 10, SOCK_MIN_SNDBUF / 2).
      
      With this patch and SMC autocorking enabled, qperf 2K/4K/8K
      tcp_bw test shows 45%/75%/40% increase in throughput respectively.
      Signed-off-by: default avatarDust Li <dust.li@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6bf536eb
    • Dust Li's avatar
      net/smc: send directly on setting TCP_NODELAY · b70a5cc0
      Dust Li authored
      In commit ea785a1a("net/smc: Send directly when
      TCP_CORK is cleared"), we don't use delayed work
      to implement cork.
      
      This patch use the same algorithm, removes the
      delayed work when setting TCP_NODELAY and send
      directly in setsockopt(). This also makes the
      TCP_NODELAY the same as TCP.
      
      Cc: Tony Lu <tonylu@linux.alibaba.com>
      Signed-off-by: default avatarDust Li <dust.li@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b70a5cc0
    • Dust Li's avatar
      net/smc: add sysctl for autocorking · 12bbb0d1
      Dust Li authored
      This add a new sysctl: net.smc.autocorking_size
      
      We can dynamically change the behaviour of autocorking
      by change the value of autocorking_size.
      Setting to 0 disables autocorking in SMC
      Signed-off-by: default avatarDust Li <dust.li@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      12bbb0d1
    • Dust Li's avatar
      net/smc: add autocorking support · dcd2cf5f
      Dust Li authored
      This patch adds autocorking support for SMC which could improve
      throughput for small message by x3+.
      
      The main idea is borrowed from TCP autocorking with some RDMA
      specific modification:
      1. The first message should never cork to make sure we won't
         bring extra latency
      2. If we have posted any Tx WRs to the NIC that have not
         completed, cork the new messages until:
         a) Receive CQE for the last Tx WR
         b) We have corked enough message on the connection
      3. Try to push the corked data out when we receive CQE of
         the last Tx WR to prevent the corked messages hang in
         the send queue.
      
      Both SMC autocorking and TCP autocorking check the TX completion
      to decide whether we should cork or not. The difference is
      when we got a SMC Tx WR completion, the data have been confirmed
      by the RNIC while TCP TX completion just tells us the data
      have been sent out by the local NIC.
      
      Add an atomic variable tx_pushing in smc_connection to make
      sure only one can send to let it cork more and save CDC slot.
      
      SMC autocorking should not bring extra latency since the first
      message will always been sent out immediately.
      
      The qperf tcp_bw test shows more than x4 increase under small
      message size with Mellanox connectX4-Lx, same result with other
      throughput benchmarks like sockperf/netperf.
      The qperf tcp_lat test shows SMC autocorking has not increase any
      ping-pong latency.
      
      Test command:
       client: smc_run taskset -c 1 qperf smc-server -oo msg_size:1:64K:*2 \
      			-t 30 -vu tcp_{bw|lat}
       server: smc_run taskset -c 1 qperf
      
      === Bandwidth ====
      MsgSize(Bytes)  SMC-NoCork           TCP                      SMC-AutoCorking
            1         0.578 MB/s       2.392 MB/s(313.57%)        2.647 MB/s(357.72%)
            2         1.159 MB/s       4.780 MB/s(312.53%)        5.153 MB/s(344.71%)
            4         2.283 MB/s      10.266 MB/s(349.77%)       10.363 MB/s(354.02%)
            8         4.668 MB/s      19.040 MB/s(307.86%)       21.215 MB/s(354.45%)
           16         9.147 MB/s      38.904 MB/s(325.31%)       41.740 MB/s(356.32%)
           32        18.369 MB/s      79.587 MB/s(333.25%)       82.392 MB/s(348.52%)
           64        36.562 MB/s     148.668 MB/s(306.61%)      161.564 MB/s(341.89%)
          128        72.961 MB/s     274.913 MB/s(276.80%)      325.363 MB/s(345.94%)
          256       144.705 MB/s     512.059 MB/s(253.86%)      633.743 MB/s(337.96%)
          512       288.873 MB/s     884.977 MB/s(206.35%)     1250.681 MB/s(332.95%)
         1024       574.180 MB/s    1337.736 MB/s(132.98%)     2246.121 MB/s(291.19%)
         2048      1095.192 MB/s    1865.952 MB/s( 70.38%)     2057.767 MB/s( 87.89%)
         4096      2066.157 MB/s    2380.337 MB/s( 15.21%)     2173.983 MB/s(  5.22%)
         8192      3717.198 MB/s    2733.073 MB/s(-26.47%)     3491.223 MB/s( -6.08%)
        16384      4742.221 MB/s    2958.693 MB/s(-37.61%)     4637.692 MB/s( -2.20%)
        32768      5349.550 MB/s    3061.285 MB/s(-42.77%)     5385.796 MB/s(  0.68%)
        65536      5162.919 MB/s    3731.408 MB/s(-27.73%)     5223.890 MB/s(  1.18%)
      ==== Latency ====
      MsgSize(Bytes)   SMC-NoCork         TCP                    SMC-AutoCorking
            1          10.540 us      11.938 us( 13.26%)       10.573 us(  0.31%)
            2          10.996 us      11.992 us(  9.06%)       10.269 us( -6.61%)
            4          10.229 us      11.687 us( 14.25%)       10.240 us(  0.11%)
            8          10.203 us      11.653 us( 14.21%)       10.402 us(  1.95%)
           16          10.530 us      11.313 us(  7.44%)       10.599 us(  0.66%)
           32          10.241 us      11.586 us( 13.13%)       10.223 us( -0.18%)
           64          10.693 us      11.652 us(  8.97%)       10.251 us( -4.13%)
          128          10.597 us      11.579 us(  9.27%)       10.494 us( -0.97%)
          256          10.409 us      11.957 us( 14.87%)       10.710 us(  2.89%)
          512          11.088 us      12.505 us( 12.78%)       10.547 us( -4.88%)
         1024          11.240 us      12.255 us(  9.03%)       10.787 us( -4.03%)
         2048          11.485 us      16.970 us( 47.76%)       11.256 us( -1.99%)
         4096          12.077 us      13.948 us( 15.49%)       12.230 us(  1.27%)
         8192          13.683 us      16.693 us( 22.00%)       13.786 us(  0.75%)
        16384          16.470 us      23.615 us( 43.38%)       16.459 us( -0.07%)
        32768          22.540 us      40.966 us( 81.75%)       23.284 us(  3.30%)
        65536          34.192 us      73.003 us(113.51%)       34.233 us(  0.12%)
      
      With SMC autocorking support, we can archive better throughput
      than TCP in most message sizes without any latency trade-off.
      Signed-off-by: default avatarDust Li <dust.li@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dcd2cf5f
    • Dust Li's avatar
      net/smc: add sysctl interface for SMC · 462791bb
      Dust Li authored
      This patch add sysctl interface to support container environment
      for SMC as we talk in the mail list.
      
      Link: https://lore.kernel.org/netdev/20220224020253.GF5443@linux.alibaba.comCo-developed-by: default avatarTony Lu <tonylu@linux.alibaba.com>
      Signed-off-by: default avatarTony Lu <tonylu@linux.alibaba.com>
      Signed-off-by: default avatarDust Li <dust.li@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      462791bb
    • David S. Miller's avatar
      Merge branch 'vxlan-vnifiltering' · 1e385c08
      David S. Miller authored
      Roopa Prabhu says:
      
      ====================
      vxlan metadata device vnifiltering support
      
      This series adds vnifiltering support to vxlan collect metadata device.
      
      Motivation:
      You can only use a single vxlan collect metadata device for a given
      vxlan udp port in the system today. The vxlan collect metadata device
      terminates all received vxlan packets. As shown in the below diagram,
      there are use-cases where you need to support multiple such vxlan devices in
      independent bridge domains. Each vxlan device must terminate the vni's
      it is configured for.
      Example usecase: In a service provider network a service provider
      typically supports multiple bridge domains with overlapping vlans.
      One bridge domain per customer. Vlans in each bridge domain are
      mapped to globally unique vxlan ranges assigned to each customer.
      
      This series adds vnifiltering support to collect metadata devices to
      terminate only configured vnis. This is similar to vlan filtering in
      bridge driver. The vni filtering capability is provided by a new flag on
      collect metadata device.
      
      In the below pic:
      	- customer1 is mapped to br1 bridge domain
      	- customer2 is mapped to br2 bridge domain
      	- customer1 vlan 10-11 is mapped to vni 1001-1002
      	- customer2 vlan 10-11 is mapped to vni 2001-2002
      	- br1 and br2 are vlan filtering bridges
      	- vxlan1 and vxlan2 are collect metadata devices with
      	  vnifiltering enabled
      
      ┌──────────────────────────────────────────────────────────────────┐
      │  switch                                                          │
      │                                                                  │
      │         ┌───────────┐                 ┌───────────┐              │
      │         │           │                 │           │              │
      │         │   br1     │                 │   br2     │              │
      │         └┬─────────┬┘                 └──┬───────┬┘              │
      │     vlans│         │               vlans │       │               │
      │     10,11│         │                10,11│       │               │
      │          │     vlanvnimap:               │    vlanvnimap:        │
      │          │       10-1001,11-1002         │      10-2001,11-2002  │
      │          │         │                     │       │               │
      │   ┌──────┴┐     ┌──┴─────────┐       ┌───┴────┐  │               │
      │   │ swp1  │     │vxlan1      │       │ swp2   │ ┌┴─────────────┐ │
      │   │       │     │  vnifilter:│       │        │ │vxlan2        │ │
      │   └───┬───┘     │   1001,1002│       └───┬────┘ │ vnifilter:   │ │
      │       │         └────────────┘           │      │  2001,2002   │ │
      │       │                                  │      └──────────────┘ │
      │       │                                  │                       │
      └───────┼──────────────────────────────────┼───────────────────────┘
              │                                  │
              │                                  │
        ┌─────┴───────┐                          │
        │  customer1  │                    ┌─────┴──────┐
        │ host/VM     │                    │customer2   │
        └─────────────┘                    │ host/VM    │
                                           └────────────┘
      
      v2:
        - remove stale xstats declarations pointed out by Nikolay Aleksandrov
        - squash selinux patch with the tunnel api patch as pointed out by
          benjamin poirier
        - Fix various build issues:
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      
      v3:
        - incorporate review feedback from Jakub
      	- move rhashtable declarations to c file
      	- define and use netlink policy for top level vxlan filter api
      	- fix unused stats function warning
      	- pass vninode from vnifilter lookup into stats count function
      		to avoid another lookup (only applicable to vxlan_rcv)
      	- fix missing vxlan vni delete notifications in vnifilter uninit
      	  function
      	- misc cleanups
        - remote dev check for multicast groups added via vnifiltering api
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1e385c08
    • Nikolay Aleksandrov's avatar
      drivers: vxlan: vnifilter: add support for stats dumping · 445b2f36
      Nikolay Aleksandrov authored
      Add support for VXLAN vni filter entries' stats dumping
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      445b2f36
    • Nikolay Aleksandrov's avatar
      drivers: vxlan: vnifilter: per vni stats · 4095e0e1
      Nikolay Aleksandrov authored
      Add per-vni statistics for vni filter mode. Counting Rx/Tx
      bytes/packets/drops/errors at the appropriate places.
      
      This patch changes vxlan_vs_find_vni to also return the
      vxlan_vni_node in cases where the vni belongs to a vni
      filtering vxlan device
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarRoopa Prabhu <roopa@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4095e0e1
    • Roopa Prabhu's avatar
      selftests: add new tests for vxlan vnifiltering · 3edf5f66
      Roopa Prabhu authored
      This patch adds a new test script test_vxlan_vnifiltering.sh
      with tests for vni filtering api, various datapath tests.
      Also has a test with a mix of traditional, metadata and vni
      filtering devices inuse at the same time.
      Signed-off-by: default avatarRoopa Prabhu <roopa@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3edf5f66
    • Roopa Prabhu's avatar
      vxlan: vni filtering support on collect metadata device · f9c4bb0b
      Roopa Prabhu authored
      This patch adds vnifiltering support to collect metadata device.
      
      Motivation:
      You can only use a single vxlan collect metadata device for a given
      vxlan udp port in the system today. The vxlan collect metadata device
      terminates all received vxlan packets. As shown in the below diagram,
      there are use-cases where you need to support multiple such vxlan devices in
      independent bridge domains. Each vxlan device must terminate the vni's
      it is configured for.
      Example usecase: In a service provider network a service provider
      typically supports multiple bridge domains with overlapping vlans.
      One bridge domain per customer. Vlans in each bridge domain are
      mapped to globally unique vxlan ranges assigned to each customer.
      
      vnifiltering support in collect metadata devices terminates only configured
      vnis. This is similar to vlan filtering in bridge driver. The vni filtering
      capability is provided by a new flag on collect metadata device.
      
      In the below pic:
      	- customer1 is mapped to br1 bridge domain
      	- customer2 is mapped to br2 bridge domain
      	- customer1 vlan 10-11 is mapped to vni 1001-1002
      	- customer2 vlan 10-11 is mapped to vni 2001-2002
      	- br1 and br2 are vlan filtering bridges
      	- vxlan1 and vxlan2 are collect metadata devices with
      	  vnifiltering enabled
      
      ┌──────────────────────────────────────────────────────────────────┐
      │  switch                                                          │
      │                                                                  │
      │         ┌───────────┐                 ┌───────────┐              │
      │         │           │                 │           │              │
      │         │   br1     │                 │   br2     │              │
      │         └┬─────────┬┘                 └──┬───────┬┘              │
      │     vlans│         │               vlans │       │               │
      │     10,11│         │                10,11│       │               │
      │          │     vlanvnimap:               │    vlanvnimap:        │
      │          │       10-1001,11-1002         │      10-2001,11-2002  │
      │          │         │                     │       │               │
      │   ┌──────┴┐     ┌──┴─────────┐       ┌───┴────┐  │               │
      │   │ swp1  │     │vxlan1      │       │ swp2   │ ┌┴─────────────┐ │
      │   │       │     │  vnifilter:│       │        │ │vxlan2        │ │
      │   └───┬───┘     │   1001,1002│       └───┬────┘ │ vnifilter:   │ │
      │       │         └────────────┘           │      │  2001,2002   │ │
      │       │                                  │      └──────────────┘ │
      │       │                                  │                       │
      └───────┼──────────────────────────────────┼───────────────────────┘
              │                                  │
              │                                  │
        ┌─────┴───────┐                          │
        │  customer1  │                    ┌─────┴──────┐
        │ host/VM     │                    │customer2   │
        └─────────────┘                    │ host/VM    │
                                           └────────────┘
      
      With this implementation, vxlan dst metadata device can
      be associated with range of vnis.
      struct vxlan_vni_node is introduced to represent
      a configured vni. We start with vni and its
      associated remote_ip in this structure. This
      structure can be extended to bring in other
      per vni attributes if there are usecases for it.
      A vni inherits an attribute from the base vxlan device
      if there is no per vni attributes defined.
      
      struct vxlan_dev gets a new rhashtable for
      vnis called vxlan_vni_group. vxlan_vnifilter.c
      implements the necessary netlink api, notifications
      and helper functions to process and manage lifecycle
      of vxlan_vni_node.
      
      This patch also adds new helper functions in vxlan_multicast.c
      to handle per vni remote_ip multicast groups which are part
      of vxlan_vni_group.
      
      Fix build problems:
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Signed-off-by: default avatarRoopa Prabhu <roopa@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f9c4bb0b
    • Roopa Prabhu's avatar
      vxlan_multicast: Move multicast helpers to a separate file · a498c595
      Roopa Prabhu authored
      subsequent patches will add more helpers.
      Signed-off-by: default avatarRoopa Prabhu <roopa@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a498c595
    • Roopa Prabhu's avatar
      rtnetlink: add new rtm tunnel api for tunnel id filtering · 7b8135f4
      Roopa Prabhu authored
      This patch adds new rtm tunnel msg and api for tunnel id
      filtering in dst_metadata devices. First dst_metadata
      device to use the api is vxlan driver with AF_BRIDGE
      family.
      
      This and later changes add ability in vxlan driver to do
      tunnel id filtering (or vni filtering) on dst_metadata
      devices. This is similar to vlan api in the vlan filtering bridge.
      
      this patch includes selinux nlmsg_route_perms support for RTM_*TUNNEL
      api from Benjamin Poirier.
      Signed-off-by: default avatarRoopa Prabhu <roopa@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7b8135f4
    • Roopa Prabhu's avatar
      vxlan_core: add helper vxlan_vni_in_use · efe0f94b
      Roopa Prabhu authored
      more users in follow up patches
      Signed-off-by: default avatarRoopa Prabhu <roopa@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      efe0f94b
    • Roopa Prabhu's avatar
      vxlan_core: make multicast helper take rip and ifindex explicitly · a9508d12
      Roopa Prabhu authored
      This patch changes multicast helpers to take rip and ifindex as input.
      This is needed in future patches where rip can come from a pervni
      structure while the ifindex can come from the vxlan device.
      Signed-off-by: default avatarRoopa Prabhu <roopa@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a9508d12
    • Roopa Prabhu's avatar
      vxlan_core: move some fdb helpers to non-static · c63053e0
      Roopa Prabhu authored
      This patch moves some fdb helpers to non-static
      for use in later patches. Ideally, all fdb code
      could move into its own file vxlan_fdb.c.
      This can be done as a subsequent patch and is out
      of scope of this series.
      Signed-off-by: default avatarRoopa Prabhu <roopa@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c63053e0
    • Roopa Prabhu's avatar
      vxlan_core: move common declarations to private header file · 76fc217d
      Roopa Prabhu authored
      This patch moves common structures and global declarations
      to a shared private headerfile vxlan_private.h. Subsequent
      patches use this header file as a common header file for
      additional shared declarations.
      Signed-off-by: default avatarRoopa Prabhu <roopa@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      76fc217d
    • Roopa Prabhu's avatar
      vxlan_core: fix build warnings in vxlan_xmit_one · fba55a66
      Roopa Prabhu authored
      Fix the below build warnings reported by kernel test robot:
         - initialize vni in vxlan_xmit_one
         - wrap label in ipv6 enabled checks in vxlan_xmit_one
      
      warnings:
      static
         drivers/net/vxlan/vxlan_core.c:2437:14: warning: variable 'label' set
      but not used [-Wunused-but-set-variable]
                 __be32 vni, label;
                             ^
      
      >> drivers/net/vxlan/vxlan_core.c:2483:7: warning: variable 'vni' is
      used uninitialized whenever 'if' condition is true
      [-Wsometimes-uninitialized]
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Signed-off-by: default avatarRoopa Prabhu <roopa@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fba55a66
    • Roopa Prabhu's avatar
      vxlan: move to its own directory · 67653936
      Roopa Prabhu authored
      vxlan.c has grown too long. This patch moves
      it to its own directory. subsequent patches add new
      functionality in new files.
      Signed-off-by: default avatarRoopa Prabhu <roopa@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      67653936
    • Jakub Kicinski's avatar
      Merge branch 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux · f2b77012
      Jakub Kicinski authored
      Saeed Mahameed says:
      
      ====================
      mlx5-next 2022-22-02
      
      The following PR includes updates to mlx5-next branch:
      
      Headlines:
      ==========
      
      1) Jakub cleans up unused static inline functions
      
      2) I did some low level firmware command interface return status changes to
      provide the caller with full visibility on the error/status returned by
      the Firmware.
      
      3) Use the new command interface in RDMA DEVX usecases to avoid flooding
      dmesg with some "expected" user error prone use cases.
      
      4) Moshe also uses the new command interface to grab the specific error
      code from MFRL register command to provide the exact error reason for
      why SW reset couldn't perform internally in FW.
      
      5) From Mark Bloch: Lag, drop packets in hardware when possible
      
      In active-backup mode the inactive interface's packets are dropped by the
      bond device. In switchdev where TC rules are offloaded to the FDB
      this can lead to packets being hit in the FDB where without offload
      they would have been dropped before reaching TC rules in the kernel.
      
      Create a drop rule to make sure packets on inactive ports are dropped
      before reaching the FDB.
      
      Listen on NETDEV_CHANGEUPPER / NETDEV_CHANGEINFODATA events and record
      the inactive state and offload accordingly.
      
      * 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
        net/mlx5: Add clarification on sync reset failure
        net/mlx5: Add reset_state field to MFRL register
        RDMA/mlx5: Use new command interface API
        net/mlx5: cmdif, Refactor error handling and reporting of async commands
        net/mlx5: Use mlx5_cmd_do() in core create_{cq,dct}
        net/mlx5: cmdif, Add new api for command execution
        net/mlx5: cmdif, cmd_check refactoring
        net/mlx5: cmdif, Return value improvements
        net/mlx5: Lag, offload active-backup drops to hardware
        net/mlx5: Lag, record inactive state of bond device
        net/mlx5: Lag, don't use magic numbers for ports
        net/mlx5: Lag, use local variable already defined to access E-Switch
        net/mlx5: E-switch, add drop rule support to ingress ACL
        net/mlx5: E-switch, remove special uplink ingress ACL handling
        net/mlx5: E-Switch, reserve and use same uplink metadata across ports
        net/mlx5: Add ability to insert to specific flow group
        mlx5: remove unused static inlines
      ====================
      
      Link: https://lore.kernel.org/r/20220223233930.319301-1-saeed@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f2b77012
  3. 28 Feb, 2022 13 commits