1. 06 Dec, 2022 6 commits
    • Leon Romanovsky's avatar
      net/mlx5e: Remove extra layers of defines · e3840530
      Leon Romanovsky authored
      Instead of performing redefinition of XFRM core defines to same
      values but with MLX5_* prefix, cache the input values as is by making
      sure that the proper storage objects are used.
      Reviewed-by: default avatarRaed Salem <raeds@nvidia.com>
      Reviewed-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Signed-off-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: default avatarSteffen Klassert <steffen.klassert@secunet.com>
      e3840530
    • Leon Romanovsky's avatar
      net/mlx5e: Store replay window in XFRM attributes · cded6d80
      Leon Romanovsky authored
      As a preparation for future extension of IPsec hardware object to allow
      configuration of packet offload mode, extend the XFRM validator to check
      replay window values.
      Reviewed-by: default avatarRaed Salem <raeds@nvidia.com>
      Reviewed-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Signed-off-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: default avatarSteffen Klassert <steffen.klassert@secunet.com>
      cded6d80
    • Leon Romanovsky's avatar
      net/mlx5e: Advertise IPsec packet offload support · 59592cfd
      Leon Romanovsky authored
      Add needed capabilities check to determine if device supports IPsec
      packet offload mode.
      Reviewed-by: default avatarRaed Salem <raeds@nvidia.com>
      Reviewed-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Signed-off-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: default avatarSteffen Klassert <steffen.klassert@secunet.com>
      59592cfd
    • Leon Romanovsky's avatar
      net/mlx5: Add HW definitions for IPsec packet offload · 3afee4ed
      Leon Romanovsky authored
      Add all needed bits to support IPsec packet offload mode.
      Reviewed-by: default avatarRaed Salem <raeds@nvidia.com>
      Reviewed-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Signed-off-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: default avatarSteffen Klassert <steffen.klassert@secunet.com>
      3afee4ed
    • Leon Romanovsky's avatar
      net/mlx5: Return ready to use ASO WQE · e77bbde7
      Leon Romanovsky authored
      There is no need in hiding returned ASO WQE type by providing void*,
      use the real type instead. Do it together with zeroing that memory,
      so ASO WQE will be ready to use immediately.
      Reviewed-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Signed-off-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: default avatarSteffen Klassert <steffen.klassert@secunet.com>
      e77bbde7
    • Steffen Klassert's avatar
      Merge branch 'Extend XFRM core to allow packet offload configuration' · 89ae6573
      Steffen Klassert authored
      Leon Romanovsky says:
      
      ============
      The following series extends XFRM core code to handle a new type of IPsec
      offload - packet offload.
      
      In this mode, the HW is going to be responsible for the whole data path,
      so both policy and state should be offloaded.
      
      IPsec packet offload is an improved version of IPsec crypto mode,
      In packet mode, HW is responsible to trim/add headers in addition to
      decrypt/encrypt. In this mode, the packet arrives to the stack as already
      decrypted and vice versa for TX (exits to HW as not-encrypted).
      
      Devices that implement IPsec packet offload mode offload policies too.
      In the RX path, it causes the situation that HW can't effectively
      handle mixed SW and HW priorities unless users make sure that HW offloaded
      policies have higher priorities.
      
      It means that we don't need to perform any search of inexact policies
      and/or priority checks if HW policy was discovered. In such situation,
      the HW will catch the packets anyway and HW can still implement inexact
      lookups.
      
      In case specific policy is not found, we will continue with packet lookup
      and check for existence of HW policies in inexact list.
      
      HW policies are added to the head of SPD to ensure fast lookup, as XFRM
      iterates over all policies in the loop.
      
      This simple solution allows us to achieve same benefits of separate HW/SW
      policies databases without over-engineering the code to iterate and manage
      two databases at the same path.
      
      To not over-engineer the code, HW policies are treated as SW ones and
      don't take into account netdev to allow reuse of the same priorities for
      policies databases without over-engineering the code to iterate and manage
      two databases at the same path.
      
      To not over-engineer the code, HW policies are treated as SW ones and
      don't take into account netdev to allow reuse of the same priorities for
      different devices.
       * No software fallback
       * Fragments are dropped, both in RX and TX
       * No sockets policies
       * Only IPsec transport mode is implemented
      
      ================================================================================
      Rekeying:
      
      In order to support rekeying, as XFRM core is skipped, the HW/driver should
      do the following:
       * Count the handled packets
       * Raise event that limits are reached
       * Drop packets once hard limit is occurred.
      
      The XFRM core calls to newly introduced xfrm_dev_state_update_curlft()
      function in order to perform sync between device statistics and internal
      structures. On HW limit event, driver calls to xfrm_state_check_expire()
      to allow XFRM core take relevant decisions.
      
      This separation between control logic (in XFRM) and data plane allows us
      to packet reuse SW stack.
      
      ================================================================================
      Configuration:
      
      iproute2: https://lore.kernel.org/netdev/cover.1652179360.git.leonro@nvidia.com/
      
      Packet offload mode:
        ip xfrm state offload packet dev <if-name> dir <in|out>
        ip xfrm policy .... offload packet dev <if-name>
      Crypto offload mode:
        ip xfrm state offload crypto dev <if-name> dir <in|out>
      or (backward compatibility)
        ip xfrm state offload dev <if-name> dir <in|out>
      
      ================================================================================
      Performance results:
      
      TCP multi-stream, using iperf3 instance per-CPU.
      +----------------------+--------+--------+--------+--------+---------+---------+
      |                      | 1 CPU  | 2 CPUs | 4 CPUs | 8 CPUs | 16 CPUs | 32 CPUs |
      |                      +--------+--------+--------+--------+---------+---------+
      |                      |                   BW (Gbps)                           |
      +----------------------+--------+--------+-------+---------+---------+---------+
      | Baseline             | 27.9   | 59     | 93.1  | 92.8    | 93.7    | 94.4    |
      +----------------------+--------+--------+-------+---------+---------+---------+
      | Software IPsec       | 6      | 11.9   | 23.3  | 45.9    | 83.8    | 91.8    |
      +----------------------+--------+--------+-------+---------+---------+---------+
      | IPsec crypto offload | 15     | 29.7   | 58.5  | 89.6    | 90.4    | 90.8    |
      +----------------------+--------+--------+-------+---------+---------+---------+
      | IPsec packet offload | 28     | 57     | 90.7  | 91      | 91.3    | 91.9    |
      +----------------------+--------+--------+-------+---------+---------+---------+
      
      IPsec packet offload mode behaves as baseline and reaches linerate with same amount
      of CPUs.
      
      Setups details (similar for both sides):
      * NIC: ConnectX6-DX dual port, 100 Gbps each.
        Single port used in the tests.
      * CPU: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz
      
      ================================================================================
      Series together with mlx5 part:
      https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/log/?h=xfrm-next
      
      ================================================================================
      Changelog:
      
      v10:
       * Added forgotten xdo_dev_state_del. Patch #4.
       * Moved changelog in cover letter to the end.
       * Added "if (xs->xso.type != XFRM_DEV_OFFLOAD_CRYPTO) {" line to newly
         added netronome IPsec support. Patch #2.
      v9: https://lore.kernel.org/all/cover.1669547603.git.leonro@nvidia.com
       * Added acquire support
      v8: https://lore.kernel.org/all/cover.1668753030.git.leonro@nvidia.com
       * Removed not-related blank line
       * Fixed typos in documentation
      v7: https://lore.kernel.org/all/cover.1667997522.git.leonro@nvidia.com
      As was discussed in IPsec workshop:
       * Renamed "full offload" to be "packet offload".
       * Added check that offloaded SA and policy have same device while sending packet
       * Added to SAD same optimization as was done for SPD to speed-up lookups.
      v6: https://lore.kernel.org/all/cover.1666692948.git.leonro@nvidia.com
       * Fixed misplaced "!" in sixth patch.
      v5: https://lore.kernel.org/all/cover.1666525321.git.leonro@nvidia.com
       * Rebased to latest ipsec-next.
       * Replaced HW priority patch with solution which mimics separated SPDs
         for SW and HW. See more description in this cover letter.
       * Dropped RFC tag, usecase, API and implementation are clear.
      v4: https://lore.kernel.org/all/cover.1662295929.git.leonro@nvidia.com
       * Changed title from "PATCH" to "PATCH RFC" per-request.
       * Added two new patches: one to update hard/soft limits and another
         initial take on documentation.
       * Added more info about lifetime/rekeying flow to cover letter, see
         relevant section.
       * perf traces for crypto mode will come later.
      v3: https://lore.kernel.org/all/cover.1661260787.git.leonro@nvidia.com
       * I didn't hear any suggestion what term to use instead of
         "packet offload", so left it as is. It is used in commit messages
         and documentation only and easy to rename.
       * Added performance data and background info to cover letter
       * Reused xfrm_output_resume() function to support multiple XFRM transformations
       * Add PMTU check in addition to driver .xdo_dev_offload_ok validation
       * Documentation is in progress, but not part of this series yet.
      v2: https://lore.kernel.org/all/cover.1660639789.git.leonro@nvidia.com
       * Rebased to latest 6.0-rc1
       * Add an extra check in TX datapath patch to validate packets before
         forwarding to HW.
       * Added policy cleanup logic in case of netdev down event
      v1: https://lore.kernel.org/all/cover.1652851393.git.leonro@nvidia.com
       * Moved comment to be before if (...) in third patch.
      v0: https://lore.kernel.org/all/cover.1652176932.git.leonro@nvidia.com
      -----------------------------------------------------------------------
      ============
      Signed-off-by: default avatarSteffen Klassert <steffen.klassert@secunet.com>
      89ae6573
  2. 05 Dec, 2022 8 commits
  3. 03 Dec, 2022 3 commits
    • Lorenzo Bianconi's avatar
      net: ethernet: mtk_wed: fix sleep while atomic in mtk_wed_wo_queue_refill · 65e6af6c
      Lorenzo Bianconi authored
      In order to fix the following sleep while atomic bug always alloc pages
      with GFP_ATOMIC in mtk_wed_wo_queue_refill since page_frag_alloc runs in
      spin_lock critical section.
      
      [    9.049719] Hardware name: MediaTek MT7986a RFB (DT)
      [    9.054665] Call trace:
      [    9.057096]  dump_backtrace+0x0/0x154
      [    9.060751]  show_stack+0x14/0x1c
      [    9.064052]  dump_stack_lvl+0x64/0x7c
      [    9.067702]  dump_stack+0x14/0x2c
      [    9.071001]  ___might_sleep+0xec/0x120
      [    9.074736]  __might_sleep+0x4c/0x9c
      [    9.078296]  __alloc_pages+0x184/0x2e4
      [    9.082030]  page_frag_alloc_align+0x98/0x1ac
      [    9.086369]  mtk_wed_wo_queue_refill+0x134/0x234
      [    9.090974]  mtk_wed_wo_init+0x174/0x2c0
      [    9.094881]  mtk_wed_attach+0x7c8/0x7e0
      [    9.098701]  mt7915_mmio_wed_init+0x1f0/0x3a0 [mt7915e]
      [    9.103940]  mt7915_pci_probe+0xec/0x3bc [mt7915e]
      [    9.108727]  pci_device_probe+0xac/0x13c
      [    9.112638]  really_probe.part.0+0x98/0x2f4
      [    9.116807]  __driver_probe_device+0x94/0x13c
      [    9.121147]  driver_probe_device+0x40/0x114
      [    9.125314]  __driver_attach+0x7c/0x180
      [    9.129133]  bus_for_each_dev+0x5c/0x90
      [    9.132953]  driver_attach+0x20/0x2c
      [    9.136513]  bus_add_driver+0x104/0x1fc
      [    9.140333]  driver_register+0x74/0x120
      [    9.144153]  __pci_register_driver+0x40/0x50
      [    9.148407]  mt7915_init+0x5c/0x1000 [mt7915e]
      [    9.152848]  do_one_initcall+0x40/0x25c
      [    9.156669]  do_init_module+0x44/0x230
      [    9.160403]  load_module+0x1f30/0x2750
      [    9.164135]  __do_sys_init_module+0x150/0x200
      [    9.168475]  __arm64_sys_init_module+0x18/0x20
      [    9.172901]  invoke_syscall.constprop.0+0x4c/0xe0
      [    9.177589]  do_el0_svc+0x48/0xe0
      [    9.180889]  el0_svc+0x14/0x50
      [    9.183929]  el0t_64_sync_handler+0x9c/0x120
      [    9.188183]  el0t_64_sync+0x158/0x15c
      
      Fixes: 79968444 ("net: ethernet: mtk_wed: introduce wed wo support")
      Signed-off-by: default avatarLorenzo Bianconi <lorenzo@kernel.org>
      Reviewed-by: default avatarPavan Chebbi <pavan.chebbi@broadcom.com>
      Link: https://lore.kernel.org/r/67ca94bdd3d9eaeb86e52b3050fbca0bcf7bb02f.1669908312.git.lorenzo@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      65e6af6c
    • Eric Dumazet's avatar
      tcp: use 2-arg optimal variant of kfree_rcu() · 55fb80d5
      Eric Dumazet authored
      kfree_rcu(1-arg) should be avoided as much as possible,
      since this is only possible from sleepable contexts,
      and incurr extra rcu barriers.
      
      I wish the 1-arg variant of kfree_rcu() would
      get a distinct name, like kfree_rcu_slow()
      to avoid it being abused.
      
      Fixes: 459837b5 ("net/tcp: Disable TCP-MD5 static key on tcp_md5sig_info destruction")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Reviewed-by: default avatarPavan Chebbi <pavan.chebbi@broadcom.com>
      Reviewed-by: default avatarDmitry Safonov <dima@arista.com>
      Link: https://lore.kernel.org/r/20221202052847.2623997-1-edumazet@google.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      55fb80d5
    • Jakub Kicinski's avatar
      Merge tag 'wireless-next-2022-12-02' of... · edd4e25a
      Jakub Kicinski authored
      Merge tag 'wireless-next-2022-12-02' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next
      
      Kalle Valo says:
      
      ====================
      wireless-next patches for v6.2
      
      Third set of patches for v6.2. mt76 has a new driver for mt7996 Wi-Fi 7
      devices and iwlwifi also got initial Wi-Fi 7 support. Otherwise
      smaller features and fixes.
      
      Major changes:
      
      ath10k
       - store WLAN firmware version in SMEM image table
      
      mt76
       - mt7996: new driver for MediaTek Wi-Fi 7 (802.11be) devices
       - mt7986, mt7915: enable Wireless Ethernet Dispatch (WED) offload support
       - mt7915: add ack signal support
       - mt7915: enable coredump support
       - mt7921: remain_on_channel support
       - mt7921: channel context support
      
      iwlwifi
       - enable Wi-Fi 7 Extremely High Throughput (EHT) PHY capabilities
       - 320 MHz channels support
      
      * tag 'wireless-next-2022-12-02' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (144 commits)
        wifi: ath10k: fix QCOM_SMEM dependency
        wifi: mt76: mt7921e: add pci .shutdown() support
        wifi: mt76: mt7915: mmio: fix naming convention
        wifi: mt76: mt7996: add support to configure spatial reuse parameter set
        wifi: mt76: mt7996: enable ack signal support
        wifi: mt76: mt7996: enable use_cts_prot support
        wifi: mt76: mt7915: rely on band_idx of mt76_phy
        wifi: mt76: mt7915: enable per bandwidth power limit support
        wifi: mt76: mt7915: introduce mt7915_get_power_bound()
        mt76: mt7915: Fix PCI device refcount leak in mt7915_pci_init_hif2()
        wifi: mt76: do not send firmware FW_FEATURE_NON_DL region
        wifi: mt76: mt7921: Add missing __packed annotation of struct mt7921_clc
        wifi: mt76: fix coverity overrun-call in mt76_get_txpower()
        wifi: mt76: mt7996: add driver for MediaTek Wi-Fi 7 (802.11be) devices
        wifi: mt76: mt76x0: remove dead code in mt76x0_phy_get_target_power
        wifi: mt76: mt7915: fix band_idx usage
        wifi: mt76: mt7915: enable .sta_set_txpwr support
        wifi: mt76: mt7915: add basedband Txpower info into debugfs
        wifi: mt76: mt7915: add support to configure spatial reuse parameter set
        wifi: mt76: mt7915: add missing MODULE_PARM_DESC
        ...
      ====================
      
      Link: https://lore.kernel.org/r/20221202214254.D0D3DC433C1@smtp.kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      edd4e25a
  4. 02 Dec, 2022 23 commits
    • Kalle Valo's avatar
      wifi: ath10k: fix QCOM_SMEM dependency · d0340718
      Kalle Valo authored
      Nathan noticed that when HWSPINLOCK is disabled there's a Kconfig warning:
      
        WARNING: unmet direct dependencies detected for QCOM_SMEM
          Depends on [n]: (ARCH_QCOM [=y] || COMPILE_TEST [=n]) && HWSPINLOCK [=n]
          Selected by [m]:
          - ATH10K_SNOC [=m] && NETDEVICES [=y] && WLAN [=y] && WLAN_VENDOR_ATH [=y] && ATH10K [=m] && (ARCH_QCOM [=y] || COMPILE_TEST [=n])
      
      The problem here is that QCOM_SMEM depends on HWSPINLOCK so we cannot select
      QCOM_SMEM and instead we neeed to use 'depends on'.
      Reported-by: default avatarNathan Chancellor <nathan@kernel.org>
      Link: https://lore.kernel.org/all/Y4YsyaIW+CPdHWv3@dev-arch.thelio-3990X/
      Fixes: 4d79f6f3 ("wifi: ath10k: Store WLAN firmware version in SMEM image table")
      Signed-off-by: default avatarKalle Valo <quic_kvalo@quicinc.com>
      Signed-off-by: default avatarKalle Valo <kvalo@kernel.org>
      Link: https://lore.kernel.org/r/20221202103027.25974-1-kvalo@kernel.org
      d0340718
    • Gerhard Engleder's avatar
      tsnep: Rework RX buffer allocation · dbadae92
      Gerhard Engleder authored
      Refill RX queue in batches of descriptors to improve performance. Refill
      is allowed to fail as long as a minimum number of descriptors is active.
      Thus, a limited number of failed RX buffer allocations is now allowed
      for normal operation. Previously every failed allocation resulted in a
      dropped frame.
      
      If the minimum number of active descriptors is reached, then RX buffers
      are still reused and frames are dropped. This ensures that the RX queue
      never runs empty and always continues to operate.
      
      Prework for future XDP support.
      Signed-off-by: default avatarGerhard Engleder <gerhard@engleder-embedded.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dbadae92
    • Gerhard Engleder's avatar
      tsnep: Throttle interrupts · d3dfe8d6
      Gerhard Engleder authored
      Without interrupt throttling, iperf server mode generates a CPU load of
      100% (A53 1.2GHz). Also the throughput suffers with less than 900Mbit/s
      on a 1Gbit/s link. The reason is a high interrupt load with interrupts
      every ~20us.
      
      Reduce interrupt load by throttling of interrupts. Interrupt delay
      default is 64us. For iperf server mode the CPU load is significantly
      reduced to ~20% and the throughput reaches the maximum of 941MBit/s.
      Interrupts are generated every ~140us.
      
      RX and TX coalesce can be configured with ethtool. RX coalesce has
      priority over TX coalesce if the same interrupt is used.
      Signed-off-by: default avatarGerhard Engleder <gerhard@engleder-embedded.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d3dfe8d6
    • Gerhard Engleder's avatar
      tsnep: Add ethtool::get_channels support · 4f661ccf
      Gerhard Engleder authored
      Allow user space to read number of TX and RX queue. This is useful for
      device dependent qdisc configurations like TAPRIO with hardware offload.
      Also ethtool::get_per_queue_coalesce / set_per_queue_coalesce requires
      that interface.
      Signed-off-by: default avatarGerhard Engleder <gerhard@engleder-embedded.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4f661ccf
    • Gerhard Engleder's avatar
    • Jonathan Toppins's avatar
      Documentation: bonding: correct xmit hash steps · 95cce3fa
      Jonathan Toppins authored
      Correct xmit hash steps for layer3+4 as introduced by commit
      49aefd13 ("bonding: do not discard lowest hash bit for non layer3+4
      hashing").
      Signed-off-by: default avatarJonathan Toppins <jtoppins@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      95cce3fa
    • Jonathan Toppins's avatar
      Documentation: bonding: update miimon default to 100 · f036b97d
      Jonathan Toppins authored
      With commit c1f897ce ("bonding: set default miimon value for non-arp
      modes if not set") the miimon default was changed from zero to 100 if
      arp_interval is also zero. Document this fact in bonding.rst.
      
      Fixes: c1f897ce ("bonding: set default miimon value for non-arp modes if not set")
      Signed-off-by: default avatarJonathan Toppins <jtoppins@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f036b97d
    • Andy Shevchenko's avatar
      net: thunderbolt: Use bitwise types in the struct thunderbolt_ip_frame_header · a479f926
      Andy Shevchenko authored
      The main usage of the struct thunderbolt_ip_frame_header is to handle
      the packets on the media layer. The header is bound to the protocol
      in which the byte ordering is crucial. However the data type definition
      doesn't use that and sparse is unhappy, for example (17 altogether):
      
        .../thunderbolt.c:718:23: warning: cast to restricted __le32
      
        .../thunderbolt.c:966:42: warning: incorrect type in assignment (different base types)
        .../thunderbolt.c:966:42:    expected unsigned int [usertype] frame_count
        .../thunderbolt.c:966:42:    got restricted __le32 [usertype]
      
      Switch to the bitwise types in the struct thunderbolt_ip_frame_header to
      reduce this, but not completely solving (9 left), because the same data
      type is used for Rx header handled locally (in CPU byte order).
      Signed-off-by: default avatarAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Acked-by: default avatarMika Westerberg <mika.westerberg@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a479f926
    • Andy Shevchenko's avatar
      net: thunderbolt: Switch from __maybe_unused to pm_sleep_ptr() etc · 0bbe50f3
      Andy Shevchenko authored
      Letting the compiler remove these functions when the kernel is built
      without CONFIG_PM_SLEEP support is simpler and less heavier for builds
      than the use of __maybe_unused attributes.
      Signed-off-by: default avatarAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Acked-by: default avatarMika Westerberg <mika.westerberg@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0bbe50f3
    • Jiri Pirko's avatar
      net: devlink: convert port_list into xarray · 47b438cc
      Jiri Pirko authored
      Some devlink instances may contain thousands of ports. Storing them in
      linked list and looking them up is not scalable. Convert the linked list
      into xarray.
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      47b438cc
    • Jakub Kicinski's avatar
      Merge branch 'hsr' · 3f5a4aa1
      Jakub Kicinski authored
      Sebastian Andrzej Siewior says:
      
      ====================
      I started playing with HSR and run into a problem. Tested latest
      upstream -rc and noticed more problems. Now it appears to work.
      For testing I have a small three node setup with iperf and ping. While
      iperf doesn't complain ping reports missing packets and duplicates.
      ====================
      
      Link: https://lore.kernel.org/r/20221129164815.128922-1-bigeasy@linutronix.de/Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      3f5a4aa1
    • Sebastian Andrzej Siewior's avatar
      selftests: Add a basic HSR test. · 7d0455e9
      Sebastian Andrzej Siewior authored
      This test adds a basic HSRv0 network with 3 nodes. In its current shape
      it sends and forwards packets, announcements and so merges nodes based
      on MAC A/B information.
      It is able to detect duplicate packets and packetloss should any occur.
      
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      7d0455e9
    • Sebastian Andrzej Siewior's avatar
      hsr: Use a single struct for self_node. · 20d3c1e9
      Sebastian Andrzej Siewior authored
      self_node_db is a list_head with one entry of struct hsr_node. The
      purpose is to hold the two MAC addresses of the node itself.
      It is convenient to recycle the structure. However having a list_head
      and fetching always the first entry is not really optimal.
      
      Created a new data strucure contaning the two MAC addresses named
      hsr_self_node. Access that structure like an RCU protected pointer so
      it can be replaced on the fly without blocking the reader.
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Reviewed-by: Kurt Kanzenbach's avatarKurt Kanzenbach <kurt@linutronix.de>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      20d3c1e9
    • Sebastian Andrzej Siewior's avatar
      hsr: Synchronize sequence number updates. · 5c7aa132
      Sebastian Andrzej Siewior authored
      hsr_register_frame_out() compares new sequence_nr vs the old one
      recorded in hsr_node::seq_out and if the new sequence_nr is higher then
      it will be written to hsr_node::seq_out as the new value.
      
      This operation isn't locked so it is possible that two frames with the
      same sequence number arrive (via the two slave devices) and are fed to
      hsr_register_frame_out() at the same time. Both will pass the check and
      update the sequence counter later to the same value. As a result the
      content of the same packet is fed into the stack twice.
      
      This was noticed by running ping and observing DUP being reported from
      time to time.
      
      Instead of using the hsr_priv::seqnr_lock for the whole receive path (as
      it is for sending in the master node) add an additional lock that is only
      used for sequence number checks and updates.
      
      Add a per-node lock that is used during sequence number reads and
      updates.
      
      Fixes: f421436a ("net/hsr: Add support for the High-availability Seamless Redundancy protocol (HSRv0)")
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      5c7aa132
    • Sebastian Andrzej Siewior's avatar
      hsr: Synchronize sending frames to have always incremented outgoing seq nr. · 06afd2c3
      Sebastian Andrzej Siewior authored
      Sending frames via the hsr (master) device requires a sequence number
      which is tracked in hsr_priv::sequence_nr and protected by
      hsr_priv::seqnr_lock. Each time a new frame is sent, it will obtain a
      new id and then send it via the slave devices.
      Each time a packet is sent (via hsr_forward_do()) the sequence number is
      checked via hsr_register_frame_out() to ensure that a frame is not
      handled twice. This make sense for the receiving side to ensure that the
      frame is not injected into the stack twice after it has been received
      from both slave ports.
      
      There is no locking to cover the sending path which means the following
      scenario is possible:
      
        CPU0				CPU1
        hsr_dev_xmit(skb1)		hsr_dev_xmit(skb2)
         fill_frame_info()             fill_frame_info()
          hsr_fill_frame_info()         hsr_fill_frame_info()
           handle_std_frame()            handle_std_frame()
            skb1's sequence_nr = 1
                                          skb2's sequence_nr = 2
         hsr_forward_do()              hsr_forward_do()
      
                                         hsr_register_frame_out(, 2)  // okay, send)
      
          hsr_register_frame_out(, 1) // stop, lower seq duplicate
      
      Both skbs (or their struct hsr_frame_info) received an unique id.
      However since skb2 was sent before skb1, the higher sequence number was
      recorded in hsr_register_frame_out() and the late arriving skb1 was
      dropped and never sent.
      
      This scenario has been observed in a three node HSR setup, with node1 +
      node2 having ping and iperf running in parallel. From time to time ping
      reported a missing packet. Based on tracing that missing ping packet did
      not leave the system.
      
      It might be possible (didn't check) to drop the sequence number check on
      the sending side. But if the higher sequence number leaves on wire
      before the lower does and the destination receives them in that order
      and it will drop the packet with the lower sequence number and never
      inject into the stack.
      Therefore it seems the only way is to lock the whole path from obtaining
      the sequence number and sending via dev_queue_xmit() and assuming the
      packets leave on wire in the same order (and don't get reordered by the
      NIC).
      
      Cover the whole path for the master interface from obtaining the ID
      until after it has been forwarded via hsr_forward_skb() to ensure the
      skbs are sent to the NIC in the order of the assigned sequence numbers.
      
      Fixes: f421436a ("net/hsr: Add support for the High-availability Seamless Redundancy protocol (HSRv0)")
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      06afd2c3
    • Sebastian Andrzej Siewior's avatar
      hsr: Disable netpoll. · d5c7652e
      Sebastian Andrzej Siewior authored
      The hsr device is a software device. Its
      net_device_ops::ndo_start_xmit() routine will process the packet and
      then pass the resulting skb to dev_queue_xmit().
      During processing, hsr acquires a lock with spin_lock_bh()
      (hsr_add_node()) which needs to be promoted to the _irq() suffix in
      order to avoid a potential deadlock.
      Then there are the warnings in dev_queue_xmit() (due to
      local_bh_disable() with disabled interrupts) left.
      
      Instead trying to address those (there is qdisc and…) for netpoll sake,
      just disable netpoll on hsr.
      
      Disable netpoll on hsr and replace the _irqsave() locking with _bh().
      
      Fixes: f421436a ("net/hsr: Add support for the High-availability Seamless Redundancy protocol (HSRv0)")
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d5c7652e
    • Sebastian Andrzej Siewior's avatar
      hsr: Avoid double remove of a node. · 0c74d9f7
      Sebastian Andrzej Siewior authored
      Due to the hashed-MAC optimisation one problem become visible:
      hsr_handle_sup_frame() walks over the list of available nodes and merges
      two node entries into one if based on the information in the supervision
      both MAC addresses belong to one node. The list-walk happens on a RCU
      protected list and delete operation happens under a lock.
      
      If the supervision arrives on both slave interfaces at the same time
      then this delete operation can occur simultaneously on two CPUs. The
      result is the first-CPU deletes the from the list and the second CPUs
      BUGs while attempting to dereference a poisoned list-entry. This happens
      more likely with the optimisation because a new node for the mac_B entry
      is created once a packet has been received and removed (merged) once the
      supervision frame has been received.
      
      Avoid removing/ cleaning up a hsr_node twice by adding a `removed' field
      which is set to true after the removal and checked before the removal.
      
      Fixes: f266a683 ("net/hsr: Better frame dispatch")
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0c74d9f7
    • Sebastian Andrzej Siewior's avatar
      hsr: Add a rcu-read lock to hsr_forward_skb(). · 5aa28201
      Sebastian Andrzej Siewior authored
      hsr_forward_skb() a skb and keeps information in an on-stack
      hsr_frame_info. hsr_get_node() assigns hsr_frame_info::node_src which is
      from a RCU list. This pointer is used later in hsr_forward_do().
      I don't see a reason why this pointer can't vanish midway since there is
      no guarantee that hsr_forward_skb() is invoked from an RCU read section.
      
      Use rcu_read_lock() to protect hsr_frame_info::node_src from its
      assignment until it is no longer used.
      
      Fixes: f266a683 ("net/hsr: Better frame dispatch")
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      5aa28201
    • Sebastian Andrzej Siewior's avatar
      Revert "net: hsr: use hlist_head instead of list_head for mac addresses" · e012764c
      Sebastian Andrzej Siewior authored
      The hlist optimisation (which not only uses hlist_head instead of
      list_head but also splits hsr_priv::node_db into an array of 256 slots)
      does not consider the "node merge":
      Upon starting the hsr network (with three nodes) a packet that is
      sent from node1 to node3 will also be sent from node1 to node2 and then
      forwarded to node3.
      As a result node3 will receive 2 packets because it is not able
      to filter out the duplicate. Each packet received will create a new
      struct hsr_node with macaddress_A only set the MAC address it received
      from (the two MAC addesses from node1).
      At some point (early in the process) two supervision frames will be
      received from node1. They will be processed by hsr_handle_sup_frame()
      and one frame will leave early ("Node has already been merged") and does
      nothing. The other frame will be merged as portB and have its MAC
      address written to macaddress_B and the hsr_node (that was created for
      it as macaddress_A) will be removed.
      From now on HSR is able to identify a duplicate because both packets
      sent from one node will result in the same struct hsr_node because
      hsr_get_node() will find the MAC address either on macaddress_A or
      macaddress_B.
      
      Things get tricky with the optimisation: If sender's MAC address is
      saved as macaddress_A then the lookup will work as usual. If the MAC
      address has been merged into macaddress_B of another hsr_node then the
      lookup won't work because it is likely that the data structure is in
      another bucket. This results in creating a new struct hsr_node and not
      recognising a possible duplicate.
      
      A way around it would be to add another hsr_node::mac_list_B and attach
      it to the other bucket to ensure that this hsr_node will be looked up
      either via macaddress_A _or_ macaddress_B.
      
      I however prefer to revert it because it sounds like an academic problem
      rather than real life workload plus it adds complexity. I'm not an HSR
      expert with what is usual size of a network but I would guess 40 to 60
      nodes. With 10.000 nodes and assuming 60us for pass-through (from node
      to node) then it would take almost 600ms for a packet to almost wrap
      around which sounds a lot.
      
      Revert the hash MAC addresses optimisation.
      
      Fixes: 4acc45db ("net: hsr: use hlist_head instead of list_head for mac addresses")
      Cc: Juhee Kang <claudiajkang@gmail.com>
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e012764c
    • Xin Long's avatar
      sctp: delete free member from struct sctp_sched_ops · 7d802c80
      Xin Long authored
      After commit 9ed7bfc7 ("sctp: fix memory leak in
      sctp_stream_outq_migrate()"), sctp_sched_set_sched() is the only
      place calling sched->free(), and it can actually be replaced by
      sched->free_sid() on each stream, and yet there's already a loop
      to traverse all streams in sctp_sched_set_sched().
      
      This patch adds a function sctp_sched_free_sched() where it calls
      sched->free_sid() for each stream to replace sched->free() calls
      in sctp_sched_set_sched() and then deletes the unused free member
      from struct sctp_sched_ops.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Acked-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Link: https://lore.kernel.org/r/e10aac150aca2686cb0bd0570299ec716da5a5c0.1669849471.git.lucien.xin@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      7d802c80
    • Jakub Kicinski's avatar
      Merge branch 'mptcp-pm-listener-events-selftests-cleanup' · e6a34faf
      Jakub Kicinski authored
      Matthieu Baerts says:
      
      ====================
      mptcp: PM listener events + selftests cleanup
      
      Thanks to the patch 6/11, the MPTCP path manager now sends Netlink events
      when MPTCP listening sockets are created and closed. The reason why it is
      needed is explained in the linked ticket [1]:
      
        MPTCP for Linux, when not using the in-kernel PM, depends on the
        userspace PM to create extra listening sockets before announcing
        addresses and ports. Let's call these "PM listeners".
      
        With the existing MPTCP netlink events, a userspace PM can create
        PM listeners at startup time, or in response to an incoming connection.
        Creating sockets in response to connections is not optimal: ADD_ADDRs
        can't be sent until the sockets are created and listen()ed, and if all
        connections are closed then it may not be clear to the userspace
        PM daemon that PM listener sockets should be cleaned up.
      
        Hence this feature request: to add MPTCP netlink events for listening
        socket close & create, so PM listening sockets can be managed based
        on application activity.
      
        [1] https://github.com/multipath-tcp/mptcp_net-next/issues/313
      
      Selftests for these new Netlink events have been added in patches 9,11/11.
      
      The remaining patches introduce different cleanups and small improvements
      in MPTCP selftests to ease the maintenance and the addition of new tests.
      ====================
      
      Link: https://lore.kernel.org/r/20221130140637.409926-1-matthieu.baerts@tessares.netSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e6a34faf
    • Geliang Tang's avatar
      selftests: mptcp: listener test for in-kernel PM · 178d0232
      Geliang Tang authored
      This patch adds test coverage for listening sockets created by the
      in-kernel path manager in mptcp_join.sh.
      
      It adds the listener event checking in the existing "remove single
      address with port" test. The output looks like this:
      
       003 remove single address with port syn[ ok ] - synack[ ok ] - ack[ ok ]
                                           add[ ok ] - echo  [ ok ] - pt [ ok ]
                                           syn[ ok ] - synack[ ok ] - ack[ ok ]
                                           syn[ ok ] - ack   [ ok ]
                                           rm [ ok ] - rmsf  [ ok ]   invert
                                           CREATE_LISTENER 10.0.2.1:10100[ ok ]
                                           CLOSE_LISTENER 10.0.2.1:10100 [ ok ]
      Signed-off-by: default avatarGeliang Tang <geliang.tang@suse.com>
      Reviewed-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      178d0232
    • Geliang Tang's avatar
      selftests: mptcp: make evts global in mptcp_join · a3735625
      Geliang Tang authored
      This patch moves evts_ns1 and evts_ns2 out of do_transfer() as two global
      variables in mptcp_join.sh. Init them in init() and remove them in
      cleanup().
      
      Add a new helper reset_with_events() to save the outputs of 'pm_nl_ctl
      events' command in them. And a new helper kill_events_pids() to kill
      pids of 'pm_nl_ctl events' command. Use these helpers in userspace pm
      tests.
      Suggested-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarGeliang Tang <geliang.tang@suse.com>
      Acked-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      a3735625