1. 09 Feb, 2021 7 commits
    • Amit Cohen's avatar
      netdevsim: fib: Do not warn if route was not found for several events · 484a4dfb
      Amit Cohen authored
      The next patch will add the ability to fail route offload controlled by
      debugfs variable called "fail_route_offload".
      
      If we vetoed the addition, we might get a delete or append notification
      for a route we do not have. Therefore, do not warn if route was not found.
      Signed-off-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      484a4dfb
    • Amit Cohen's avatar
      IPv6: Extend 'fib_notify_on_flag_change' sysctl · 6fad361a
      Amit Cohen authored
      Add the value '2' to 'fib_notify_on_flag_change' to allow sending
      notifications only for failed route installation.
      
      Separate value is added for such notifications because there are less of
      them, so they do not impact performance and some users will find them more
      important.
      Signed-off-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6fad361a
    • Amit Cohen's avatar
      IPv6: Add "offload failed" indication to routes · 0c5fcf9e
      Amit Cohen authored
      After installing a route to the kernel, user space receives an
      acknowledgment, which means the route was installed in the kernel, but not
      necessarily in hardware.
      
      The asynchronous nature of route installation in hardware can lead to a
      routing daemon advertising a route before it was actually installed in
      hardware. This can result in packet loss or mis-routed packets until the
      route is installed in hardware.
      
      To avoid such cases, previous patch set added the ability to emit
      RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags
      are changed, this behavior is controlled by sysctl.
      
      With the above mentioned behavior, it is possible to know from user-space
      if the route was offloaded, but if the offload fails there is no indication
      to user-space. Following a failure, a routing daemon will wait indefinitely
      for a notification that will never come.
      
      This patch adds an "offload_failed" indication to IPv6 routes, so that
      users will have better visibility into the offload process.
      
      'struct fib6_info' is extended with new field that indicates if route
      offload failed. Note that the new field is added using unused bit and
      therefore there is no need to increase struct size.
      Signed-off-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0c5fcf9e
    • Amit Cohen's avatar
      IPv4: Extend 'fib_notify_on_flag_change' sysctl · 648106c3
      Amit Cohen authored
      Add the value '2' to 'fib_notify_on_flag_change' to allow sending
      notifications only for failed route installation.
      
      Separate value is added for such notifications because there are less of
      them, so they do not impact performance and some users will find them more
      important.
      Signed-off-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      648106c3
    • Amit Cohen's avatar
      IPv4: Add "offload failed" indication to routes · 36c5100e
      Amit Cohen authored
      After installing a route to the kernel, user space receives an
      acknowledgment, which means the route was installed in the kernel, but not
      necessarily in hardware.
      
      The asynchronous nature of route installation in hardware can lead to a
      routing daemon advertising a route before it was actually installed in
      hardware. This can result in packet loss or mis-routed packets until the
      route is installed in hardware.
      
      To avoid such cases, previous patch set added the ability to emit
      RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags
      are changed, this behavior is controlled by sysctl.
      
      With the above mentioned behavior, it is possible to know from user-space
      if the route was offloaded, but if the offload fails there is no indication
      to user-space. Following a failure, a routing daemon will wait indefinitely
      for a notification that will never come.
      
      This patch adds an "offload_failed" indication to IPv4 routes, so that
      users will have better visibility into the offload process.
      
      'struct fib_alias', and 'struct fib_rt_info' are extended with new field
      that indicates if route offload failed. Note that the new field is added
      using unused bit and therefore there is no need to increase structs size.
      Signed-off-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      36c5100e
    • Amit Cohen's avatar
      rtnetlink: Add RTM_F_OFFLOAD_FAILED flag · 49fc2513
      Amit Cohen authored
      The flag indicates to user space that route offload failed.
      
      Previous patch set added the ability to emit RTM_NEWROUTE notifications
      whenever RTM_F_OFFLOAD/RTM_F_TRAP flags are changed, but if the offload
      fails there is no indication to user-space.
      
      The flag will be used in subsequent patches by netdevsim and mlxsw to
      indicate to user space that route offload failed, so that users will
      have better visibility into the offload process.
      Signed-off-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      49fc2513
    • David S. Miller's avatar
      Merge tag 'mlx5-updates-2021-02-04' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · 08cbabb7
      David S. Miller authored
      mlx5-updates-2021-02-04
      
      Vlad Buslov says:
      =================
      
      Implement support for VF tunneling
      
      Abstract
      
      Currently, mlx5 only supports configuration with tunnel endpoint IP address on
      uplink representor. Remove implicit and explicit assumptions of tunnel always
      being terminated on uplink and implement necessary infrastructure for
      configuring tunnels on VF representors and updating rules on such tunnels
      according to routing changes.
      
      SW TC model
      
      From TC perspective VF tunnel configuration requires two rules in both
      directions:
      
      TX rules
      
      1. Rule that redirects packets from UL to VF rep that has the tunnel
      endpoint IP address:
      
      $ tc -s filter show dev enp8s0f0 ingress
      filter protocol ip pref 4 flower chain 0
      filter protocol ip pref 4 flower chain 0 handle 0x1
        dst_mac 16:c9:a0:2d:69:2c
        src_mac 0c:42:a1:58:ab:e4
        eth_type ipv4
        ip_flags nofrag
        in_hw in_hw_count 1
              action order 1: mirred (Egress Redirect to device enp8s0f0_0) stolen
              index 3 ref 1 bind 1 installed 377 sec used 0 sec
              Action statistics:
              Sent 114096 bytes 952 pkt (dropped 0, overlimits 0 requeues 0)
              Sent software 0 bytes 0 pkt
              Sent hardware 114096 bytes 952 pkt
              backlog 0b 0p requeues 0
              cookie 878fa48d8c423fc08c3b6ca599b50a97
              no_percpu
              used_hw_stats delayed
      
      2. Rule that decapsulates the tunneled flow and redirects to destination VF
      representor:
      
      $ tc -s filter show dev vxlan_sys_4789 ingress
      filter protocol ip pref 4 flower chain 0
      filter protocol ip pref 4 flower chain 0 handle 0x1
        dst_mac ca:2e:a7:3f:f5:0f
        src_mac 0a:40:bd:30:89:99
        eth_type ipv4
        enc_dst_ip 7.7.7.5
        enc_src_ip 7.7.7.1
        enc_key_id 98
        enc_dst_port 4789
        enc_tos 0
        ip_flags nofrag
        in_hw in_hw_count 1
              action order 1: tunnel_key  unset pipe
               index 2 ref 1 bind 1 installed 434 sec used 434 sec
              Action statistics:
              Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
              backlog 0b 0p requeues 0
              used_hw_stats delayed
      
              action order 2: mirred (Egress Redirect to device enp8s0f0_1) stolen
              index 4 ref 1 bind 1 installed 434 sec used 0 sec
              Action statistics:
              Sent 129936 bytes 1082 pkt (dropped 0, overlimits 0 requeues 0)
              Sent software 0 bytes 0 pkt
              Sent hardware 129936 bytes 1082 pkt
              backlog 0b 0p requeues 0
              cookie ac17cf398c4c69e4a5b2f7aabd1b88ff
              no_percpu
              used_hw_stats delayed
      
      RX rules
      
      1. Rule that encapsulates the tunneled flow and redirects packets from
      source VF rep to tunnel device:
      
      $ tc -s filter show dev enp8s0f0_1 ingress
      filter protocol ip pref 4 flower chain 0
      filter protocol ip pref 4 flower chain 0 handle 0x1
        dst_mac 0a:40:bd:30:89:99
        src_mac ca:2e:a7:3f:f5:0f
        eth_type ipv4
        ip_tos 0/0x3
        ip_flags nofrag
        in_hw in_hw_count 1
              action order 1: tunnel_key  set
              src_ip 7.7.7.5
              dst_ip 7.7.7.1
              key_id 98
              dst_port 4789
              nocsum
              ttl 64 pipe
               index 1 ref 1 bind 1 installed 411 sec used 411 sec
              Action statistics:
              Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
              backlog 0b 0p requeues 0
              no_percpu
              used_hw_stats delayed
      
              action order 2: mirred (Egress Redirect to device vxlan_sys_4789) stolen
              index 1 ref 1 bind 1 installed 411 sec used 0 sec
              Action statistics:
              Sent 5615833 bytes 4028 pkt (dropped 0, overlimits 0 requeues 0)
              Sent software 0 bytes 0 pkt
              Sent hardware 5615833 bytes 4028 pkt
              backlog 0b 0p requeues 0
              cookie bb406d45d343bf7ade9690ae80c7cba4
              no_percpu
              used_hw_stats delayed
      
      2. Rule that redirects from tunnel device to UL rep:
      
      $ tc -s filter show dev vxlan_sys_4789 ingress
      filter protocol ip pref 4 flower chain 0
      filter protocol ip pref 4 flower chain 0 handle 0x1
        dst_mac ca:2e:a7:3f:f5:0f
        src_mac 0a:40:bd:30:89:99
        eth_type ipv4
        enc_dst_ip 7.7.7.5
        enc_src_ip 7.7.7.1
        enc_key_id 98
        enc_dst_port 4789
        enc_tos 0
        ip_flags nofrag
        in_hw in_hw_count 1
              action order 1: tunnel_key  unset pipe
               index 2 ref 1 bind 1 installed 434 sec used 434 sec
              Action statistics:
              Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
              backlog 0b 0p requeues 0
              used_hw_stats delayed
      
              action order 2: mirred (Egress Redirect to device enp8s0f0_1) stolen
              index 4 ref 1 bind 1 installed 434 sec used 0 sec
              Action statistics:
              Sent 129936 bytes 1082 pkt (dropped 0, overlimits 0 requeues 0)
              Sent software 0 bytes 0 pkt
              Sent hardware 129936 bytes 1082 pkt
              backlog 0b 0p requeues 0
              cookie ac17cf398c4c69e4a5b2f7aabd1b88ff
              no_percpu
              used_hw_stats delayed
      
      HW offloads model
      
      For hardware offload the goal is to mach packet on both rules without exposing
      it to software on tunnel endpoint VF. In order to achieve this for tx, TC
      implementation marks encap rules with tunnel endpoint on mlx5 VF of same eswitch
      with MLX5_ESW_DEST_CHAIN_WITH_SRC_PORT_CHANGE flag and adds header modification
      rule to overwrite packet source port to the value of tunnel VF. Eswitch code is
      modified to recirculate such packets after source port value is changed, which
      allows second tx rules to match.
      
      For rx path indirect table infrastructure is used to allow fully processing VF
      tunnel traffic in hardware. To implement such pipeline driver needs to program
      the hardware after matching on UL rule to overwrite source vport from UL to
      tunnel VF and recirculate the packet to the root table to allow matching on the
      rule installed on tunnel VF. For this, indirect table matches all encapsulated
      traffic by tunnel parameters and all other IP traffic is sent to tunnel VF by
      the miss rule. Such configuration will cause packet to appear on VF representor
      instead of VF itself if packet has been matches by indirect table rule based on
      tunnel parameters but missed on second rule (after recirculation). Handle such
      case by marking packets processed by indirect table with special 0xFFF value in
      reg_c1 and extending slow table with additional flow group that matches on
      reg_c0 (source port value set by indirect tables) and reg_c1 (special 0xFFF
      mark). When creating offloads fdb tables, install one rule per VF vport to match
      on recirculated miss packets and redirect them to appropriate VF vport.
      
      Routing events
      
      In order to support routing changes and migration of tunnel device between
      different endpoint VFs, implement routing infrastructure and update it with FIB
      events. Routing entry table is introduced to mlx5 TC. Every rx and tx VF tunnel
      rule is attached to a routing entry, which is shared for rules of same tunnel.
      On FIB event the work is scheduled to delete/recreate all rules of affected
      tunnel.
      
      Note: only vxlan tunnel type is supported by this series.
      
      =================
      08cbabb7
  2. 08 Feb, 2021 9 commits
  3. 07 Feb, 2021 1 commit
    • Jakub Kicinski's avatar
      Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue · badc6ac3
      Jakub Kicinski authored
      Tony Nguyen says:
      
      ====================
      100GbE Intel Wired LAN Driver Updates 2021-02-05
      
      This series contains updates to ice driver only.
      
      Jake adds adds reporting of timeout length during devlink flash and
      implements support to report devlink info regarding the version of
      firmware that is stored (downloaded) to the device, but is not yet active.
      ice_devlink_info_get will report "stored" versions when there is no
      pending flash update. Version info includes the UNDI Option ROM, the
      Netlist module, and the fw.bundle_id.
      
      Gustavo A. R. Silva replaces a one-element array to flexible-array
      member.
      
      Bruce utilizes flex_array_size() helper and removes dead code on a check
      for a condition that can't occur.
      
      v2:
      * removed security revision implementation, and re-ordered patches to
      account for this removal
      * squashed patches implementing ice_read_flash_module to avoid patches
      refactoring the implementation of a previous patch in the series
      * modify ice_devlink_info_get to always report "stored" versions instead
      of only reporting them when a pending flash update is ready.
      
      * '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
        ice: remove dead code
        ice: use flex_array_size where possible
        ice: Replace one-element array with flexible-array member
        ice: display stored UNDI firmware version via devlink info
        ice: display stored netlist versions via devlink info
        ice: display some stored NVM versions via devlink info
        ice: introduce function for reading from flash modules
        ice: cache NVM module bank information
        ice: introduce context struct for info report
        ice: create flash_info structure and separate NVM version
        ice: report timeout length for erasing during devlink flash
      ====================
      
      Link: https://lore.kernel.org/r/20210206044101.636242-1-anthony.l.nguyen@intel.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      badc6ac3
  4. 06 Feb, 2021 23 commits
    • Jakub Kicinski's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next · c273a20c
      Jakub Kicinski authored
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter/IPVS updates for net-next
      
      1) Remove indirection and use nf_ct_get() instead from nfnetlink_log
         and nfnetlink_queue, from Florian Westphal.
      
      2) Add weighted random twos choice least-connection scheduling for IPVS,
         from Darby Payne.
      
      3) Add a __hash placeholder in the flow tuple structure to identify
         the field to be included in the rhashtable key hash calculation.
      
      4) Add a new nft_parse_register_load() and nft_parse_register_store()
         to consolidate register load and store in the core.
      
      5) Statify nft_parse_register() since it has no more module clients.
      
      6) Remove redundant assignment in nft_cmp, from Colin Ian King.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next:
        netfilter: nftables: remove redundant assignment of variable err
        netfilter: nftables: statify nft_parse_register()
        netfilter: nftables: add nft_parse_register_store() and use it
        netfilter: nftables: add nft_parse_register_load() and use it
        netfilter: flowtable: add hash offset field to tuple
        ipvs: add weighted random twos choice algorithm
        netfilter: ctnetlink: remove get_ct indirection
      ====================
      
      Link: https://lore.kernel.org/r/20210206015005.23037-1-pablo@netfilter.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c273a20c
    • Heiner Kallweit's avatar
      r8169: don't try to disable interrupts if NAPI is scheduled already · 7274c414
      Heiner Kallweit authored
      There's no benefit in trying to disable interrupts if NAPI is
      scheduled already. This allows us to save a PCI write in this case.
      Signed-off-by: default avatarHeiner Kallweit <hkallweit1@gmail.com>
      Link: https://lore.kernel.org/r/78c7f2fb-9772-1015-8c1d-632cbdff253f@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      7274c414
    • Xie He's avatar
      net/packet: Improve the comment about LL header visibility criteria · 21c85974
      Xie He authored
      The "dev_has_header" function, recently added in
      commit d5496990 ("net/packet: fix packet receive on L3 devices
      without visible hard header"),
      is more accurate as criteria for determining whether a device exposes
      the LL header to upper layers, because in addition to dev->header_ops,
      it also checks for dev->header_ops->create.
      
      When transmitting an skb on a device, dev_hard_header can be called to
      generate an LL header. dev_hard_header will only generate a header if
      dev->header_ops->create is present.
      Signed-off-by: default avatarXie He <xie.he.0141@gmail.com>
      Acked-by: default avatarWillem de Bruijn <willemb@google.com>
      Link: https://lore.kernel.org/r/20210205224124.21345-1-xie.he.0141@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      21c85974
    • Jakub Kicinski's avatar
      Merge branch 'net-ipa-a-mix-of-small-improvements' · 163a1802
      Jakub Kicinski authored
      Alex Elder says:
      
      ====================
      net: ipa: a mix of small improvements
      
      Version 2 of this series restructures a couple of the changed
      functions (in patches 1 and 2) to avoid blocks of indented code
      by returning early when possible, as suggested by Jakub.  The
      description of the first patch was changed as a result, to better
      reflect what the updated patch does.  It also fixes one spot I
      identified when updating the code, where gsi_channel_stop() was
      doing the wrong thing on error.
      
      The original description for this series is below.
      
      This series contains a sort of unrelated set of code cleanups.
      
      The first two are things I wanted to do in a series that updated
      some NAPI code recently.  I didn't want to change things in a way
      that affected existing testing so I set these aside for later
      (i.e., now).
      
      The third makes a change to event ring handling that's similar to
      what was done a while back for channels.  There's little benefit to
      cacheing the current state of an event ring, so with this we'll just
      fetch the state from hardware whenever we need it.
      
      The fourth patch removes the definitions of two unused symbols.
      
      The fifth replaces a count that is always 0 or 1 with a Boolean.
      
      The sixth removes a build-time validation check that doesn't really
      provide benefit.
      
      And the last one fixes a problem (in two spots) that could cause a
      build-time check to fail "bogusly".
      ====================
      
      Link: https://lore.kernel.org/r/20210205221100.1738-1-elder@linaro.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      163a1802
    • Alex Elder's avatar
      net: ipa: avoid field overflow · cd115009
      Alex Elder authored
      It's possible that the length passed to ipa_header_size_encoded()
      is larger than what can be represented by the HDR_LEN field alone
      (starting with IPA v4.5).  If we attempted that, u32_encode_bits()
      would trigger a build-time error.
      
      Avoid this problem by masking off high-order bits of the value
      encoded as the lower portion of the header length.
      
      The same sort of problem exists in ipa_metadata_offset_encoded(),
      so implement the same fix there.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      cd115009
    • Alex Elder's avatar
      net: ipa: get rid of status size constraint · 48735374
      Alex Elder authored
      There is a build-time check that the packet status structure is a
      multiple of 4 bytes in size.  It's not clear where that constraint
      comes from, but the structure defines what hardware provides so its
      definition won't change.  Get rid of the check; it adds no value.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      48735374
    • Alex Elder's avatar
      net: ipa: use a Boolean rather than count when replenishing · 9af5ccf3
      Alex Elder authored
      The count argument to ipa_endpoint_replenish() is only ever 0 or 1,
      and always will be (because we always handle each receive buffer in
      a single transaction).  Rename the argument to be add_one and change
      it to be Boolean.
      
      Update the function description to reflect the current code.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      9af5ccf3
    • Alex Elder's avatar
      net: ipa: remove two unused register definitions · d5bc5015
      Alex Elder authored
      We do not support inter-EE channel or event ring commands.  Inter-EE
      interrupts are disabled (and never re-enabled) for all channels and
      event rings, so we have no need for the GSI registers that clear
      those interrupt conditions.  So remove their definitions.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d5bc5015
    • Alex Elder's avatar
      net: ipa: do not cache event ring state · 3f77c926
      Alex Elder authored
      An event ring's state only needs to be known when it is allocated,
      reset, or deallocated.  We check an event ring's state both before
      and after performing an event ring control command that changes
      its state.  These are only issued at startup and shutdown, so there
      is very little value in caching the state.
      
      Stop recording a copy of the channel's last known state, and instead
      fetch the true state from hardware whenever it's needed.  In such
      cases, *do* record the state in a local variable, in case an error
      message reports it (so the value reported is the value seen).
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      3f77c926
    • Alex Elder's avatar
      net: ipa: synchronize NAPI only for suspend · b1750723
      Alex Elder authored
      When stopping a channel, gsi_channel_stop() will ensure NAPI
      polling is complete when it calls napi_disable().  So there is no
      need to call napi_synchronize() in that case.
      
      Move the call to napi_synchronize() out of __gsi_channel_stop()
      and into gsi_channel_suspend(), so it's only used where needed.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b1750723
    • Alex Elder's avatar
      net: ipa: move mutex calls into __gsi_channel_stop() · 63ec9be1
      Alex Elder authored
      Move the mutex calls out of gsi_channel_stop_retry() and into
      __gsi_channel_stop(), to make the latter more semantically similar
      to __gsi_channel_start().
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      63ec9be1
    • Jakub Kicinski's avatar
      Merge branch 'lag-offload-for-ocelot-dsa-switches' · bfc213f1
      Jakub Kicinski authored
      Vladimir Oltean says:
      
      ====================
      LAG offload for Ocelot DSA switches
      
      This patch series reworks the ocelot switchdev driver such that it could
      share the same implementation for LAG offload as the felix DSA driver.
      
      Testing has been done in the following topology:
      
               +----------------------------------+
               | Board 1         br0              |
               |             +---------+          |
               |            /           \         |
               |            |           |         |
               |            |         bond0       |
               |            |        +-----+      |
               |            |       /       \     |
               |  eno0     swp0    swp1    swp2   |
               +---|--------|-------|-------|-----+
                   |        |       |       |
                   +--------+       |       |
                     Cable          |       |
                               Cable|       |Cable
                     Cable          |       |
                   +--------+       |       |
                   |        |       |       |
               +---|--------|-------|-------|-----+
               |  eno0     swp0    swp1    swp2   |
               |            |       \       /     |
               |            |        +-----+      |
               |            |         bond0       |
               |            |           |         |
               |            \           /         |
               |             +---------+          |
               | Board 2         br0              |
               +----------------------------------+
      
      The same script can be run on both Board 1 and Board 2 to set this up:
      
      ip link del bond0
      ip link add bond0 type bond mode balance-xor miimon 1
      OR
      ip link add bond0 type bond mode 802.3ad
      ip link set swp1 down && ip link set swp1 master bond0 && ip link set swp1 up
      ip link set swp2 down && ip link set swp2 master bond0 && ip link set swp2 up
      ip link del br0
      ip link add br0 type bridge
      ip link set bond0 master br0
      ip link set swp0 master br0
      
      Then traffic can be tested between eno0 of Board 1 and eno0 of Board 2.
      ====================
      
      Link: https://lore.kernel.org/r/20210205220221.255646-1-olteanv@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      bfc213f1
    • Vladimir Oltean's avatar
      net: dsa: felix: propagate the LAG offload ops towards the ocelot lib · 8fe6832e
      Vladimir Oltean authored
      The ocelot switch has been supporting LAG offload since its initial
      commit, however felix could not make use of that, due to lack of a LAG
      abstraction in DSA. Now that we have that, let's forward DSA's calls
      towards the ocelot library, who will deal with setting up the bonding.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8fe6832e
    • Vladimir Oltean's avatar
      net: dsa: make assisted_learning_on_cpu_port bypass offloaded LAG interfaces · a324d3d4
      Vladimir Oltean authored
      Given the following topology, and focusing only on Box A:
      
               Box A
               +----------------------------------+
               | Board 1         br0              |
               |             +---------+          |
               |            /           \         |
               |            |           |         |
               |            |         bond0       |
               |            |        +-----+      |
               |192.168.1.1 |       /       \     |
               |  eno0     swp0    swp1    swp2   |
               +---|--------|-------|-------|-----+
                   |        |       |       |
                   +--------+       |       |
                     Cable          |       |
                               Cable|       |Cable
                     Cable          |       |
                   +--------+       |       |
                   |        |       |       |
               +---|--------|-------|-------|-----+
               |  eno0     swp0    swp1    swp2   |
               |192.168.1.2 |       \       /     |
               |            |        +-----+      |
               |            |         bond0       |
               |            |           |         |
               |            \           /         |
               |             +---------+          |
               | Board 2         br0              |
               +----------------------------------+
               Box B
      
      The assisted_learning_on_cpu_port logic will see that swp0 is bridged
      with a "foreign interface" (bond0) and will therefore install all
      addresses learnt by the software bridge towards bond0 (including the
      address of eno0 on Box B) as static addresses towards the CPU port.
      
      But that's not what we want - bond0 is not really a "foreign interface"
      but one we can offload including L2 forwarding from/towards it. So we
      need to refine our logic for assisted learning such that, whenever we
      see an address learnt on a non-DSA interface, we search through the tree
      for any port that offloads that non-DSA interface.
      
      Some confusion might arise as to why we search through the whole tree
      instead of just the local switch returned by dsa_slave_dev_lower_find.
      Or a different angle of the same confusion: why does
      dsa_slave_dev_lower_find(br_dev) return a single dp that's under br_dev
      instead of the whole list of bridged DSA ports?
      
      To answer the second question, it should be enough to install the static
      FDB entry on the CPU port of a single switch in the tree, because
      dsa_port_fdb_add uses DSA_NOTIFIER_FDB_ADD which ensures that all other
      switches in the tree get notified of that address, and add the entry
      themselves using dsa_towards_port().
      
      This should help understand the answer to the first question: the port
      returned by dsa_slave_dev_lower_find may not be on the same switch as
      the ports that offload the LAG. Nonetheless, if the driver implements
      .crosschip_lag_join and .crosschip_bridge_join as mv88e6xxx does, there
      still isn't any reason for trapping addresses learnt on the remote LAG
      towards the CPU, and we should prevent that.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      a324d3d4
    • Vladimir Oltean's avatar
      net: mscc: ocelot: rebalance LAGs on link up/down events · 23ca3b72
      Vladimir Oltean authored
      At present there is an issue when ocelot is offloading a bonding
      interface, but one of the links of the physical ports goes down. Traffic
      keeps being hashed towards that destination, and of course gets dropped
      on egress.
      
      Monitor the netdev notifier events emitted by the bonding driver for
      changes in the physical state of lower interfaces, to determine which
      ports are active and which ones are no longer.
      
      Then extend ocelot_get_bond_mask to return either the configured bonding
      interfaces, or the active ones, depending on a boolean argument. The
      code that does rebalancing only needs to do so among the active ports,
      whereas the bridge forwarding mask and the logical port IDs still need
      to look at the permanently bonded ports.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      23ca3b72
    • Vladimir Oltean's avatar
      net: mscc: ocelot: rename aggr_count to num_ports_in_lag · 21357b61
      Vladimir Oltean authored
      It makes it a bit easier to read and understand the code that deals with
      balancing the 16 aggregation codes among the ports in a certain LAG.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      21357b61
    • Vladimir Oltean's avatar
      net: mscc: ocelot: drop the use of the "lags" array · 528d3f19
      Vladimir Oltean authored
      We can now simplify the implementation by always using ocelot_get_bond_mask
      to look up the other ports that are offloading the same bonding interface
      as us.
      
      In ocelot_set_aggr_pgids, the code had a way to uniquely iterate through
      LAGs. We need to achieve the same behavior by marking each LAG as visited,
      which we do now by using a temporary 32-bit "visited" bitmask. This is
      ok and we do not need dynamic memory allocation, because we know that
      this switch architecture will not have more than 32 ports (the PGID port
      masks are 32-bit anyway).
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      528d3f19
    • Vladimir Oltean's avatar
      net: mscc: ocelot: set up logical port IDs centrally · 2527f2e8
      Vladimir Oltean authored
      The setup of logical port IDs is done in two places: from the inconclusively
      named ocelot_setup_lag and from ocelot_port_lag_leave, a function that
      also calls ocelot_setup_lag (which apparently does an incomplete setup
      of the LAG).
      
      To improve this situation, we can rename ocelot_setup_lag into
      ocelot_setup_logical_port_ids, and drop the "lag" argument. It will now
      set up the logical port IDs of all switch ports, which may be just
      slightly more inefficient but more maintainable.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      2527f2e8
    • Vladimir Oltean's avatar
      net: mscc: ocelot: avoid unneeded "lp" variable in LAG join · 2e9f4afa
      Vladimir Oltean authored
      The index of the LAG is equal to the logical port ID that all the
      physical port members have, which is further equal to the index of the
      first physical port that is a member of the LAG.
      
      The code gets a bit carried away with logic like this:
      
      	if (a == b)
      		c = a;
      	else
      		c = b;
      
      which can be simplified, of course, into:
      
      	c = b;
      
      (with a being port, b being lp, c being lag)
      
      This further makes the "lp" variable redundant, since we can use "lag"
      everywhere where "lp" (logical port) was used. So instead of a "c = b"
      assignment, we can do a complete deletion of b. Only one comment here:
      
      		if (bond_mask) {
      			lp = __ffs(bond_mask);
      			ocelot->lags[lp] = 0;
      		}
      
      lp was clobbered before, because it was used as a temporary variable to
      hold the new smallest port ID from the bond. Now that we don't have "lp"
      any longer, we'll just avoid the temporary variable and zeroize the
      bonding mask directly.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarAlexandre Belloni <alexandre.belloni@bootlin.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      2e9f4afa
    • Vladimir Oltean's avatar
      net: mscc: ocelot: set up the bonding mask in a way that avoids a net_device · b80af659
      Vladimir Oltean authored
      Since this code should be called from pure switchdev as well as from
      DSA, we must find a way to determine the bonding mask not by looking
      directly at the net_device lowers of the bonding interface, since those
      could have different private structures.
      
      We keep a pointer to the bonding upper interface, if present, in struct
      ocelot_port. Then the bonding mask becomes the bitwise OR of all ports
      that have the same bonding upper interface. This adds a duplication of
      functionality with the current "lags" array, but the duplication will be
      short-lived, since further patches will remove the latter completely.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarAlexandre Belloni <alexandre.belloni@bootlin.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b80af659
    • Vladimir Oltean's avatar
      net: mscc: ocelot: use ipv6 in the aggregation code · f79c20c8
      Vladimir Oltean authored
      IPv6 header information is not currently part of the entropy source for
      the 4-bit aggregation code used for LAG offload, even though it could be.
      The hardware reference manual says about these fields:
      
      ANA::AGGR_CFG.AC_IP6_TCPUDP_PORT_ENA
      Use IPv6 TCP/UDP port when calculating aggregation code. Configure
      identically for all ports. Recommended value is 1.
      
      ANA::AGGR_CFG.AC_IP6_FLOW_LBL_ENA
      Use IPv6 flow label when calculating AC. Configure identically for all
      ports. Recommended value is 1.
      
      Integration with the xmit_hash_policy of the bonding interface is TBD.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarAlexandre Belloni <alexandre.belloni@bootlin.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f79c20c8
    • Vladimir Oltean's avatar
      net: mscc: ocelot: don't refuse bonding interfaces we can't offload · 583cbbe3
      Vladimir Oltean authored
      Since switchdev/DSA exposes network interfaces that fulfill many of the
      same user space expectations that dedicated NICs do, it makes sense to
      not deny bonding interfaces with a bonding policy that we cannot offload,
      but instead allow the bonding driver to select the egress interface in
      software.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarAlexandre Belloni <alexandre.belloni@bootlin.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      583cbbe3
    • Vladimir Oltean's avatar
      net: mscc: ocelot: use a switch-case statement in ocelot_netdevice_event · 41e66fa2
      Vladimir Oltean authored
      Make ocelot's net device event handler more streamlined by structuring
      it in a similar way with others. The inspiration here was
      dsa_slave_netdevice_event.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarAlexandre Belloni <alexandre.belloni@bootlin.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      41e66fa2