1. 12 Mar, 2021 40 commits
    • Maor Dickman's avatar
      net/mlx5e: Allow to match on ICMP parameters · a3222a2d
      Maor Dickman authored
      Support matching on ICMPv4/6 type and code parameters using misc3
      section of match parameters.
      Signed-off-by: default avatarMaor Dickman <maord@nvidia.com>
      Reviewed-by: default avatarRoi Dayan <roid@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      a3222a2d
    • Paul Blakey's avatar
      net/mlx5: CT: Add support for mirroring · 69e2916e
      Paul Blakey authored
      Add support for mirroring before the CT action by spliting the pre ct rule.
      Mirror outputs are done first on the tc chain,prio table rule (the fwd
      rule), which will then forward to a per port fwd table.
      On this fwd table, we insert the original pre ct rule that forwards to
      ct/ct nat table.
      Signed-off-by: default avatarPaul Blakey <paulb@mellanox.com>
      Signed-off-by: default avatarMaor Dickman <maord@nvidia.com>
      Reviewed-by: default avatarRoi Dayan <roid@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      69e2916e
    • Alaa Hleihel's avatar
      net/mlx5: Display the command index in command mailbox dump · 287e0df0
      Alaa Hleihel authored
      Multiple commands can be printed at the same time which can
      lead to wrong order of their lines in dmesg output.
      As a result, it's hard to match data dumps to the correct command
      or which command was fully dumped at some point.
      
      Fix this by displaying the corresponding command index, and also
      indicate when a command was fully dumped.
      Signed-off-by: default avatarAlaa Hleihel <alaa@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      287e0df0
    • Arnd Bergmann's avatar
      net/mlx5e: allocate 'indirection_rqt' buffer dynamically · 2119bda6
      Arnd Bergmann authored
      Increasing the size of the indirection_rqt array from 128 to 256 bytes
      pushed the stack usage of the mlx5e_hairpin_fill_rqt_rqns() function
      over the warning limit when building with clang and CONFIG_KASAN:
      
      drivers/net/ethernet/mellanox/mlx5/core/en_tc.c:970:1: error: stack frame size of 1180 bytes in function 'mlx5e_tc_add_nic_flow' [-Werror,-Wframe-larger-than=]
      
      Using dynamic allocation here is safe because the caller does the
      same, and it reduces the stack usage of the function to just a few
      bytes.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      2119bda6
    • Tariq Toukan's avatar
      net/mlx5e: Dump ICOSQ WQE descriptor on CQE with error events · e16cf9d7
      Tariq Toukan authored
      Dump the ICOSQ's WQE descriptor when a completion with error is received.
      Signed-off-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      e16cf9d7
    • Maxim Mikityanskiy's avatar
      net/mlx5e: Use net_prefetchw instead of prefetchw in MPWQE TX datapath · 991b2654
      Maxim Mikityanskiy authored
      Commit e20f0dbf ("net/mlx5e: RX, Add a prefetch command for small
      L1_CACHE_BYTES") switched to using net_prefetchw at all places in mlx5e.
      In the same time frame, commit 5af75c74 ("net/mlx5e: Enhanced TX
      MPWQE for SKBs") added one more usage of prefetchw. When these two
      changes were merged, this new occurrence of prefetchw wasn't replaced
      with net_prefetchw.
      
      This commit fixes this last occurrence of prefetchw in
      mlx5e_tx_mpwqe_session_start, making the same change that was done in
      mlx5e_xdp_mpwqe_session_start.
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Reviewed-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      991b2654
    • Roi Dayan's avatar
      net/mlx5e: Remove redundant newline in NL_SET_ERR_MSG_MOD · bca08a91
      Roi Dayan authored
      Fix the following coccicheck warnings:
      
      drivers/net/ethernet/mellanox/mlx5/core/devlink.c:145:29-66: WARNING
      avoid newline at end of message in NL_SET_ERR_MSG_MOD
      drivers/net/ethernet/mellanox/mlx5/core/devlink.c:140:29-77: WARNING
      avoid newline at end of message in NL_SET_ERR_MSG_MOD
      Signed-off-by: default avatarRoi Dayan <roid@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      bca08a91
    • Mark Zhang's avatar
      net/mlx5: Read congestion counters from all ports when lag is active · 093bd764
      Mark Zhang authored
      Read congestion counters from all ports in any lag mode rather than
      only in RoCE lag mode (e.g., VF lag).
      Signed-off-by: default avatarMark Zhang <markzhang@nvidia.com>
      Reviewed-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Reviewed-by: default avatarMaor Gottlieb <maorg@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      093bd764
    • Jiapeng Chong's avatar
      net/mlx5: remove unneeded semicolon · 79760922
      Jiapeng Chong authored
      Fix the following coccicheck warnings:
      
      ./drivers/net/ethernet/mellanox/mlx5/core/sf/devlink.c:495:2-3: Unneeded
      semicolon.
      Reported-by: default avatarAbaci Robot <abaci@linux.alibaba.com>
      Signed-off-by: default avatarJiapeng Chong <jiapeng.chong@linux.alibaba.com>
      Reviewed-by: default avatarParav Pandit <parav@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      79760922
    • Junlin Yang's avatar
      net/mlx5: use kvfree() for memory allocated with kvzalloc() · ad2c99ca
      Junlin Yang authored
      It is allocated with kvzalloc(), the corresponding release function
      should not be kfree(), use kvfree() instead.
      
      Generated by: scripts/coccinelle/api/kfree_mismatch.cocci
      Signed-off-by: default avatarJunlin Yang <yangjunlin@yulong.com>
      Reviewed-by: default avatarRoi Dayan <roid@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      ad2c99ca
    • Yevgeny Kliteynik's avatar
      net/mlx5: DR, Add missing vhca_id consume from STEv1 · cc82a2e6
      Yevgeny Kliteynik authored
      The field source_eswitch_owner_vhca_id was not consumed
      in the same way as in STEv0. Added the missing set.
      
      Fixes: 10b69418 ("net/mlx5: DR, Add HW STEv1 match logic")
      Signed-off-by: default avatarAlex Vesker <valex@mellanox.com>
      Signed-off-by: default avatarYevgeny Kliteynik <kliteyn@nvidia.com>
      Reviewed-by: default avatarAlex Vesker <valex@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      cc82a2e6
    • Yevgeny Kliteynik's avatar
      net/mlx5: DR, Remove unneeded rx_decap_l3 function for STEv1 · 14124778
      Yevgeny Kliteynik authored
      Remove the dr_ste_v1_set_rx_decap_l3 function that was
      replaced by another function - fixing a rebase error.
      Signed-off-by: default avatarYevgeny Kliteynik <kliteyn@nvidia.com>
      Reviewed-by: default avatarAlex Vesker <valex@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      14124778
    • Yevgeny Kliteynik's avatar
      net/mlx5: DR, Fixed typo in STE v0 · 0142f097
      Yevgeny Kliteynik authored
      "reforamt" -> "reformat"
      Signed-off-by: default avatarYevgeny Kliteynik <kliteyn@nvidia.com>
      Reviewed-by: default avatarAlex Vesker <valex@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      0142f097
    • Jonathan Neuschäfer's avatar
      docs: networking: phy: Improve placement of parenthesis · bfdfe7fc
      Jonathan Neuschäfer authored
      "either" is outside the parentheses, so the matching "or" should be too.
      Signed-off-by: default avatarJonathan Neuschäfer <j.neuschaefer@gmx.net>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bfdfe7fc
    • David S. Miller's avatar
      Merge branch 'tcp-delayed-completions' · 5215206d
      David S. Miller authored
      Eric Dumazet says:
      
      ====================
      tcp: better deal with delayed TX completions
      
      Jakub and Neil reported an increase of RTO timers whenever
      TX completions are delayed a bit more (by increasing
      NIC TX coalescing parameters)
      
      While problems have been there forever, second patch might
      introduce some regressions so I prefer not backport
      them to stable releases before things settle.
      
      Many thanks to FB team for their help and tests.
      
      Few packetdrill tests need to be changed to reflect
      the improvements brought by this series.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5215206d
    • Eric Dumazet's avatar
      tcp: remove obsolete check in __tcp_retransmit_skb() · ac3959fd
      Eric Dumazet authored
      TSQ provides a nice way to avoid bufferbloat on individual socket,
      including retransmit packets. We can get rid of the old
      heuristic:
      
      	/* Do not sent more than we queued. 1/4 is reserved for possible
      	 * copying overhead: fragmentation, tunneling, mangling etc.
      	 */
      	if (refcount_read(&sk->sk_wmem_alloc) >
      	    min_t(u32, sk->sk_wmem_queued + (sk->sk_wmem_queued >> 2),
      		  sk->sk_sndbuf))
      		return -EAGAIN;
      
      This heuristic was giving false positives according to Jakub,
      whenever TX completions are delayed above RTT. (Ack packets
      are processed by TCP stack before clones are orphaned/freed)
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarJakub Kicinski <kuba@kernel.org>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ac3959fd
    • Eric Dumazet's avatar
      tcp: consider using standard rtx logic in tcp_rcv_fastopen_synack() · a7abf3cd
      Eric Dumazet authored
      Jakub reported Data included in a Fastopen SYN that had to be
      retransmit would have to wait for an RTO if TX completions are slow,
      even with prior fix.
      
      This is because tcp_rcv_fastopen_synack() does not use standard
      rtx logic, meaning TSQ handler exits early in tcp_tsq_write()
      because tp->lost_out == tp->retrans_out
      
      Lets make tcp_rcv_fastopen_synack() use standard rtx logic,
      by using tcp_mark_skb_lost() on the skb thats needs to be
      sent again.
      
      Not this raised a warning in tcp_fastretrans_alert() during my tests
      since we consider the data not being aknowledged
      by the receiver does not mean packet was lost on the network.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarJakub Kicinski <kuba@kernel.org>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a7abf3cd
    • Eric Dumazet's avatar
      tcp: plug skb_still_in_host_queue() to TSQ · f4dae54e
      Eric Dumazet authored
      Jakub and Neil reported an increase of RTO timers whenever
      TX completions are delayed a bit more (by increasing
      NIC TX coalescing parameters)
      
      Main issue is that TCP stack has a logic preventing a packet
      being retransmit if the prior clone has not yet been
      orphaned or freed.
      
      This logic came with commit 1f3279ae ("tcp: avoid
      retransmits of TCP packets hanging in host queues")
      
      Thankfully, in the case skb_still_in_host_queue() detects
      the initial clone is still in flight, it can use TSQ logic
      that will eventually retry later, at the moment the clone
      is freed or orphaned.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarNeil Spring <ntspring@fb.com>
      Reported-by: default avatarJakub Kicinski <kuba@kernel.org>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f4dae54e
    • Tong Zhang's avatar
      isdn: remove extra spaces in the header file · 8176f8c0
      Tong Zhang authored
      fix some coding style issues in the isdn header
      Signed-off-by: default avatarTong Zhang <ztong0001@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8176f8c0
    • Hoang Huu Le's avatar
      tipc: clean up warnings detected by sparse · 97bc84bb
      Hoang Huu Le authored
      This patch fixes the following warning from sparse:
      
      net/tipc/monitor.c:263:35: warning: incorrect type in assignment (different base types)
      net/tipc/monitor.c:263:35:    expected unsigned int
      net/tipc/monitor.c:263:35:    got restricted __be32 [usertype]
      [...]
      net/tipc/node.c:374:13: warning: context imbalance in 'tipc_node_read_lock' - wrong count at exit
      net/tipc/node.c:379:13: warning: context imbalance in 'tipc_node_read_unlock' - unexpected unlock
      net/tipc/node.c:384:13: warning: context imbalance in 'tipc_node_write_lock' - wrong count at exit
      net/tipc/node.c:389:13: warning: context imbalance in 'tipc_node_write_unlock_fast' - unexpected unlock
      net/tipc/node.c:404:17: warning: context imbalance in 'tipc_node_write_unlock' - unexpected unlock
      [...]
      net/tipc/crypto.c:1201:9: warning: incorrect type in initializer (different address spaces)
      net/tipc/crypto.c:1201:9:    expected struct tipc_aead [noderef] __rcu *__tmp
      net/tipc/crypto.c:1201:9:    got struct tipc_aead *
      [...]
      Acked-by: default avatarJon Maloy <jmaloy@redhat.com>
      Signed-off-by: default avatarHoang Huu Le <hoang.h.le@dektech.com.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      97bc84bb
    • Hoang Le's avatar
      tipc: convert dest node's address to network order · 1980d375
      Hoang Le authored
      (struct tipc_link_info)->dest is in network order (__be32), so we must
      convert the value to network order before assigning. The problem detected
      by sparse:
      
      net/tipc/netlink_compat.c:699:24: warning: incorrect type in assignment (different base types)
      net/tipc/netlink_compat.c:699:24:    expected restricted __be32 [usertype] dest
      net/tipc/netlink_compat.c:699:24:    got int
      Acked-by: default avatarJon Maloy <jmaloy@redhat.com>
      Signed-off-by: default avatarHoang Le <hoang.h.le@dektech.com.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1980d375
    • David S. Miller's avatar
      Merge branch 'mlxsw-Implement-sampling-using-mirroring' · 1520929e
      David S. Miller authored
      Ido Schimmel says:
      
      ====================
      mlxsw: Implement sampling using mirroring
      
      So far, sampling was implemented using a dedicated sampling mechanism
      that is available on all Spectrum ASICs. Spectrum-2 and later ASICs
      support sampling by mirroring packets to the CPU port with probability.
      This method has a couple of advantages compared to the legacy method:
      
      * Extra metadata per-packet: Egress port, egress traffic class, traffic
        class occupancy and end-to-end latency
      * Ability to sample packets on egress / per-flow as opposed to only
        ingress
      
      This series should not result in any user-visible changes and its aim is
      to convert Spectrum-2 and later ASICs to perform sampling by mirroring
      to the CPU port with probability. Future submissions will expose the
      additional metadata and enable sampling using more triggers (e.g.,
      egress).
      
      Series overview:
      
      Patches #1-#3 extend the SPAN (mirroring) module to accept new
      parameters required for sampling. See individual commit messages for
      detailed explanation.
      
      Patch #4-#5 split sampling support between Spectrum-1 and later ASIC while
      still using the legacy method for all ASIC generations.
      
      Patch #6 converts Spectrum-2 and later ASICs to perform sampling by
      mirroring to the CPU port with probability.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1520929e
    • Ido Schimmel's avatar
      mlxsw: spectrum_matchall: Implement sampling using mirroring · cf31190a
      Ido Schimmel authored
      Spectrum-2 and later ASICs support sampling of packets by mirroring to
      the CPU with probability. There are several advantages compared to the
      legacy dedicated sampling mechanism:
      
      * Extra metadata per-packet: Egress port, egress traffic class, traffic
        class occupancy and end-to-end latency
      * Ability to sample packets on egress / per-flow
      
      Convert Spectrum-2 and later ASICs to perform sampling by mirroring to
      the CPU with probability.
      
      Subsequent patches will add support for egress / per-flow sampling and
      expose the extra metadata.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cf31190a
    • Ido Schimmel's avatar
      mlxsw: spectrum_trap: Split sampling traps between ASICs · 34a27721
      Ido Schimmel authored
      Sampling of ingress packets is supported using a dedicated sampling
      mechanism on all Spectrum ASICs. However, Spectrum-2 and later ASICs
      support more sophisticated sampling by mirroring packets to the CPU.
      
      As a preparation for more advanced sampling configurations, split the trap
      configuration used for sampled packets between Spectrum-1 and later ASICs.
      
      This is needed since packets that are mirrored to the CPU are trapped
      via a different trap identifier compared to packets that are sampled
      using the dedicated sampling mechanism.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      34a27721
    • Ido Schimmel's avatar
      mlxsw: spectrum_matchall: Split sampling support between ASICs · 20afb9bc
      Ido Schimmel authored
      Sampling of ingress packets is supported using a dedicated sampling
      mechanism on all Spectrum ASICs. However, Spectrum-2 and later ASICs
      support more sophisticated sampling by mirroring packets to the CPU.
      
      As a preparation for more advanced sampling configurations, split the
      sampling operations between Spectrum-1 and later ASICs.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      20afb9bc
    • Ido Schimmel's avatar
      mlxsw: spectrum_span: Add SPAN probability rate support · 2dcbd920
      Ido Schimmel authored
      Currently, every packet that matches a mirroring trigger (e.g., received
      packets, buffer dropped packets) is mirrored. Spectrum-2 and later ASICs
      support mirroring with probability, where every 1 in N matched packets
      is mirrored.
      
      Extend the API that creates the binding between the trigger and the SPAN
      agent with a probability rate parameter, which is an attribute of the
      trigger. Set it to '1' to maintain existing behavior.
      
      Subsequent patches will use it to perform more sophisticated sampling,
      by mirroring packets to the CPU with probability.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2dcbd920
    • Ido Schimmel's avatar
      mlxsw: reg: Extend mirroring registers with probability rate field · fa3faeb7
      Ido Schimmel authored
      The MPAR and MPAGR registers are used to configure the binding between
      the mirroring trigger (e.g., received packet) and the SPAN agent. Add
      probability rate field, which will allow us to support sampling by
      mirroring to the CPU.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fa3faeb7
    • Ido Schimmel's avatar
      mlxsw: spectrum_span: Add SPAN session identifier support · 5c7659eb
      Ido Schimmel authored
      When packets are mirrored to the CPU, the trap identifier with which the
      packets are trapped is determined according to the session identifier of
      the SPAN agent performing the mirroring. Packets that are trapped for
      the same logical reason (e.g., buffer drops) should use the same session
      identifier.
      
      Currently, a single session is implicitly supported (identifier 0) and
      is used for packets that are mirrored to the CPU due to buffer drops
      (e.g., early drop).
      
      Subsequent patches are going to mirror packets to the CPU due to
      sampling, which will require a different session identifier.
      
      Prepare for that by making the session identifier an attribute of the
      SPAN agent.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5c7659eb
    • David S. Miller's avatar
      Merge tag 'mlx5-updates-2021-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · 1bc61c9d
      David S. Miller authored
      Saeed Mahameed says:
      
      ====================
      This series provides some cleanups to mlx5 driver
      For more information please see tag log below.
      
      Please pull and let me know if there is any problem.
      
      mlx5-updates-2021-03-11
      
      Cleanups for mlx5 driver
      
      1) Fix build warnings form Arnd and Vlad
      2) Leon improves locking for driver load/unload flows
      3) From Roi, Lockdep false dependency warning
      4) Other trivial cleanups
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1bc61c9d
    • David S. Miller's avatar
      Merge branch 'nexthop-Resilient-next-hop-groups' · 2a0186a3
      David S. Miller authored
      Petr Machata says:
      
      ====================
      nexthop: Resilient next-hop groups
      
      At this moment, there is only one type of next-hop group: an mpath group.
      Mpath groups implement the hash-threshold algorithm, described in RFC
      2992[1].
      
      To select a next hop, hash-threshold algorithm first assigns a range of
      hashes to each next hop in the group, and then selects the next hop by
      comparing the SKB hash with the individual ranges. When a next hop is
      removed from the group, the ranges are recomputed, which leads to
      reassignment of parts of hash space from one next hop to another. RFC 2992
      illustrates it thus:
      
                   +-------+-------+-------+-------+-------+
                   |   1   |   2   |   3   |   4   |   5   |
                   +-------+-+-----+---+---+-----+-+-------+
                   |    1    |    2    |    4    |    5    |
                   +---------+---------+---------+---------+
      
                    Before and after deletion of next hop 3
      	      under the hash-threshold algorithm.
      
      Note how next hop 2 gave up part of the hash space in favor of next hop 1,
      and 4 in favor of 5. While there will usually be some overlap between the
      previous and the new distribution, some traffic flows change the next hop
      that they resolve to.
      
      If a multipath group is used for load-balancing between multiple servers,
      this hash space reassignment causes an issue that packets from a single
      flow suddenly end up arriving at a server that does not expect them, which
      may lead to TCP reset.
      
      If a multipath group is used for load-balancing among available paths to
      the same server, the issue is that different latencies and reordering along
      the way causes the packets to arrive in the wrong order.
      
      Resilient hashing is a technique to address the above problem. Resilient
      next-hop group has another layer of indirection between the group itself
      and its constituent next hops: a hash table. The selection algorithm uses a
      straightforward modulo operation on the SKB hash to choose a hash table
      bucket, then reads the next hop that this bucket contains, and forwards
      traffic there.
      
      This indirection brings an important feature. In the hash-threshold
      algorithm, the range of hashes associated with a next hop must be
      continuous. With a hash table, mapping between the hash table buckets and
      the individual next hops is arbitrary. Therefore when a next hop is deleted
      the buckets that held it are simply reassigned to other next hops:
      
                   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
                   |1|1|1|1|2|2|2|2|3|3|3|3|4|4|4|4|5|5|5|5|
                   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      	                      v v v v
                   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
                   |1|1|1|1|2|2|2|2|1|2|4|5|4|4|4|4|5|5|5|5|
                   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      
                    Before and after deletion of next hop 3
      	      under the resilient hashing algorithm.
      
      When weights of next hops in a group are altered, it may be possible to
      choose a subset of buckets that are currently not used for forwarding
      traffic, and use those to satisfy the new next-hop distribution demands,
      keeping the "busy" buckets intact. This way, established flows are ideally
      kept being forwarded to the same endpoints through the same paths as before
      the next-hop group change.
      
      This patch set adds the implementation of resilient next-hop groups.
      
      In a nutshell, the algorithm works as follows. Each next hop has a number
      of buckets that it wants to have, according to its weight and the number of
      buckets in the hash table. In case of an event that might cause bucket
      allocation change, the numbers for individual next hops are updated,
      similarly to how ranges are updated for mpath group next hops. Following
      that, a new "upkeep" algorithm runs, and for idle buckets that belong to a
      next hop that is currently occupying more buckets than it wants (it is
      "overweight"), it migrates the buckets to one of the next hops that has
      fewer buckets than it wants (it is "underweight"). If, after this, there
      are still underweight next hops, another upkeep run is scheduled to a
      future time.
      
      Chances are there are not enough "idle" buckets to satisfy the new demands.
      The algorithm has knobs to select both what it means for a bucket to be
      idle, and for whether and when to forcefully migrate buckets if there keeps
      being an insufficient number of idle ones.
      
      To illustrate the usage, consider the following commands:
      
       # ip nexthop add id 1 via 192.0.2.2 dev dummy1
       # ip nexthop add id 2 via 192.0.2.3 dev dummy1
       # ip nexthop add id 10 group 1/2 type resilient \
      	buckets 8 idle_timer 60 unbalanced_timer 300
      
      The last command creates a resilient next-hop group. It will have 8
      buckets, each bucket will be considered idle when no traffic hits it for at
      least 60 seconds, and if the table remains out of balance for 300 seconds,
      it will be forcefully brought into balance.
      
      If not present in netlink message, the idle timer defaults to 120 seconds,
      and there is no unbalanced timer, meaning the group may remain unbalanced
      indefinitely. The value of 120 is the default in Cumulus implementation of
      resilient next-hop groups. To a degree the default is arbitrary, the only
      value that certainly does not make sense is 0. Therefore going with an
      existing deployed implementation is reasonable.
      
      Unbalanced time, i.e. how long since the last time that all nexthops had as
      many buckets as they should according to their weights, is reported when
      the group is dumped:
      
       # ip nexthop show id 10
       id 10 group 1/2 type resilient buckets 8 idle_timer 60 unbalanced_timer 300 unbalanced_time 0
      
      When replacing next hops or changing weights, if one does not specify some
      parameters, their value is left as it was:
      
       # ip nexthop replace id 10 group 1,2/2 type resilient
       # ip nexthop show id 10
       id 10 group 1,2/2 type resilient buckets 8 idle_timer 60 unbalanced_timer 300 unbalanced_time 0
      
      It is also possible to do a dump of individual buckets (and now you know
      why there were only 8 of them in the example above):
      
       # ip nexthop bucket show id 10
       id 10 index 0 idle_time 5.59 nhid 1
       id 10 index 1 idle_time 5.59 nhid 1
       id 10 index 2 idle_time 8.74 nhid 2
       id 10 index 3 idle_time 8.74 nhid 2
       id 10 index 4 idle_time 8.74 nhid 1
       id 10 index 5 idle_time 8.74 nhid 1
       id 10 index 6 idle_time 8.74 nhid 1
       id 10 index 7 idle_time 8.74 nhid 1
      
      Note the two buckets that have a shorter idle time. Those are the ones that
      were migrated after the nexthop replace command to satisfy the new demand
      that nexthop 1 be given 6 buckets instead of 4.
      
      The patchset proceeds as follows:
      
      - Patches #1 and #2 are small refactoring patches.
      
      - Patch #3 adds a new flag to struct nh_group, is_multipath. This flag is
        meant to be set for all nexthop groups that in general have several
        nexthops from which they choose, and avoids a more expensive dispatch
        based on reading several flags, one for each nexthop group type.
      
      - Patch #4 contains defines of new UAPI attributes and the new next-hop
        group type. At this point, the nexthop code is made to bounce the new
        type. As the resilient hashing code is gradually added in the following
        patch sets, it will remain dead. The last patch will make it accessible.
      
        This patch also adds a suite of new messages related to next hop buckets.
        This approach was taken instead of overloading the information on the
        existing RTM_{NEW,DEL,GET}NEXTHOP messages for the following reasons.
      
        First, a next-hop group can contain a large number of next-hop buckets
        (4k is not unheard of). This imposes limits on the amount of information
        that can be encoded for each next-hop bucket given a netlink message is
        limited to 64k bytes.
      
        Second, while RTM_NEWNEXTHOPBUCKET is only used for notifications at this
        point, in the future it can be extended to provide user space with
        control over next-hop buckets configuration.
      
      - Patch #5 contains the meat of the resilient next-hop group support.
      
      - Patches #6 and #7 implement support for notifications towards the
        drivers.
      
      - Patch #8 adds an interface for the drivers to report resilient hash
        table bucket activity. Drivers will be able to report through this
        interface whether traffic is hitting a given bucket.
      
      - Patch #9 adds an interface for the drivers to report whether a given
        hash table bucket is offloaded or trapping traffic.
      
      - In patches #10, #11, #12 and #13, UAPI is implemented. This includes all
        the code necessary for creation of resilient groups, bucket dumping and
        getting, and bucket migration notifications.
      
      - In patch #14 the next-hop groups are finally made available.
      
      The overall plan is to contribute approximately the following patchsets:
      
      1) Nexthop policy refactoring (already pushed)
      2) Preparations for resilient next-hop groups (already pushed)
      3) Implementation of resilient next-hop groups (this patchset)
      4) Netdevsim offload plus a suite of selftests
      5) Preparations for mlxsw offload of resilient next-hop groups
      6) mlxsw offload including selftests
      
      Interested parties can look at the current state of the code at [2] and
      [3].
      
      [1] https://tools.ietf.org/html/rfc2992
      [2] https://github.com/idosch/linux/commits/submit/res_integ_v1
      [3] https://github.com/idosch/iproute2/commits/submit/res_v1
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2a0186a3
    • Petr Machata's avatar
      nexthop: Enable resilient next-hop groups · 15e1dd57
      Petr Machata authored
      Now that all the code is in place, stop rejecting requests to create
      resilient next-hop groups.
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      15e1dd57
    • Petr Machata's avatar
      nexthop: Notify userspace about bucket migrations · 0b4818aa
      Petr Machata authored
      Nexthop replacements et.al. are notified through netlink, but if a delayed
      work migrates buckets on the background, userspace will stay oblivious.
      Notify these as RTM_NEWNEXTHOPBUCKET events.
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0b4818aa
    • Petr Machata's avatar
      nexthop: Add netlink handlers for bucket get · 187d4c6b
      Petr Machata authored
      Allow getting (but not setting) individual buckets to inspect the next hop
      mapped therein, idle time, and flags.
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      187d4c6b
    • Petr Machata's avatar
      nexthop: Add netlink handlers for bucket dump · 8a1bbabb
      Petr Machata authored
      Add a dump handler for resilient next hop buckets. When next-hop group ID
      is given, it walks buckets of that group, otherwise it walks buckets of all
      groups. It then dumps the buckets whose next hops match the given filtering
      criteria.
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8a1bbabb
    • Petr Machata's avatar
      nexthop: Add netlink handlers for resilient nexthop groups · a2601e2b
      Petr Machata authored
      Implement the netlink messages that allow creation and dumping of resilient
      nexthop groups.
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a2601e2b
    • Ido Schimmel's avatar
      nexthop: Allow reporting activity of nexthop buckets · cfc15c1d
      Ido Schimmel authored
      The kernel periodically checks the idle time of nexthop buckets to
      determine if they are idle and can be re-populated with a new nexthop.
      
      When the resilient nexthop group is offloaded to hardware, the kernel
      will not see activity on nexthop buckets unless it is reported from
      hardware.
      
      Add a function that can be periodically called by device drivers to
      report activity on nexthop buckets after querying it from the underlying
      device.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cfc15c1d
    • Ido Schimmel's avatar
      nexthop: Allow setting "offload" and "trap" indication of nexthop buckets · 56ad5ba3
      Ido Schimmel authored
      Add a function that can be called by device drivers to set "offload" or
      "trap" indication on nexthop buckets following nexthop notifications and
      other changes such as a neighbour becoming invalid.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      56ad5ba3
    • Petr Machata's avatar
      nexthop: Implement notifiers for resilient nexthop groups · 7c37c7e0
      Petr Machata authored
      Implement the following notifications towards drivers:
      
      - NEXTHOP_EVENT_REPLACE, when a resilient nexthop group is created.
      
      - NEXTHOP_EVENT_BUCKET_REPLACE any time there is a change in assignment of
        next hops to hash table buckets. That includes replacements, deletions,
        and delayed upkeep cycles. Some bucket notifications can be vetoed by the
        driver, to make it possible to propagate bucket busy-ness flags from the
        HW back to the algorithm. Some are however forced, e.g. if a next hop is
        deleted, all buckets that use this next hop simply must be migrated,
        whether the HW wishes so or not.
      
      - NEXTHOP_EVENT_RES_TABLE_PRE_REPLACE, before a resilient nexthop group is
        replaced. Usually the driver will get the bucket notifications as well,
        and could veto those. But in some cases, a bucket may not be migrated
        immediately, but during delayed upkeep, and that is too late to roll the
        transaction back. This notification allows the driver to take a look and
        veto the new proposed group up front, before anything is committed.
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7c37c7e0
    • Ido Schimmel's avatar
      nexthop: Add data structures for resilient group notifications · b8f090d0
      Ido Schimmel authored
      Add data structures that will be used for in-kernel notifications about
      addition / deletion of a resilient nexthop group and about changes to a
      hash bucket within a resilient group.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b8f090d0
    • Petr Machata's avatar
      nexthop: Add implementation of resilient next-hop groups · 283a72a5
      Petr Machata authored
      At this moment, there is only one type of next-hop group: an mpath group,
      which implements the hash-threshold algorithm.
      
      To select a next hop, hash-threshold algorithm first assigns a range of
      hashes to each next hop in the group, and then selects the next hop by
      comparing the SKB hash with the individual ranges. When a next hop is
      removed from the group, the ranges are recomputed, which leads to
      reassignment of parts of hash space from one next hop to another. While
      there will usually be some overlap between the previous and the new
      distribution, some traffic flows change the next hop that they resolve to.
      That causes problems e.g. as established TCP connections are reset, because
      the traffic is forwarded to a server that is not familiar with the
      connection.
      
      Resilient hashing is a technique to address the above problem. Resilient
      next-hop group has another layer of indirection between the group itself
      and its constituent next hops: a hash table. The selection algorithm uses a
      straightforward modulo operation to choose a hash bucket, and then reads
      the next hop that this bucket contains, and forwards traffic there.
      
      This indirection brings an important feature. In the hash-threshold
      algorithm, the range of hashes associated with a next hop must be
      continuous. With a hash table, mapping between the hash table buckets and
      the individual next hops is arbitrary. Therefore when a next hop is deleted
      the buckets that held it are simply reassigned to other next hops. When
      weights of next hops in a group are altered, it may be possible to choose a
      subset of buckets that are currently not used for forwarding traffic, and
      use those to satisfy the new next-hop distribution demands, keeping the
      "busy" buckets intact. This way, established flows are ideally kept being
      forwarded to the same endpoints through the same paths as before the
      next-hop group change.
      
      In a nutshell, the algorithm works as follows. Each next hop has a number
      of buckets that it wants to have, according to its weight and the number of
      buckets in the hash table. In case of an event that might cause bucket
      allocation change, the numbers for individual next hops are updated,
      similarly to how ranges are updated for mpath group next hops. Following
      that, a new "upkeep" algorithm runs, and for idle buckets that belong to a
      next hop that is currently occupying more buckets than it wants (it is
      "overweight"), it migrates the buckets to one of the next hops that has
      fewer buckets than it wants (it is "underweight"). If, after this, there
      are still underweight next hops, another upkeep run is scheduled to a
      future time.
      
      Chances are there are not enough "idle" buckets to satisfy the new demands.
      The algorithm has knobs to select both what it means for a bucket to be
      idle, and for whether and when to forcefully migrate buckets if there keeps
      being an insufficient number of idle buckets.
      
      There are three users of the resilient data structures.
      
      - The forwarding code accesses them under RCU, and does not modify them
        except for updating the time a selected bucket was last used.
      
      - Netlink code, running under RTNL, which may modify the data.
      
      - The delayed upkeep code, which may modify the data. This runs unlocked,
        and mutual exclusion between the RTNL code and the delayed upkeep is
        maintained by canceling the delayed work synchronously before the RTNL
        code touches anything. Later it restarts the delayed work if necessary.
      
      The RTNL code has to implement next-hop group replacement, next hop
      removal, etc. For removal, the mpath code uses a neat trick of having a
      backup next hop group structure, doing the necessary changes offline, and
      then RCU-swapping them in. However, the hash tables for resilient hashing
      are about an order of magnitude larger than the groups themselves (the size
      might be e.g. 4K entries), and it was felt that keeping two of them is an
      overkill. Both the primary next-hop group and the spare therefore use the
      same resilient table, and writers are careful to keep all references valid
      for the forwarding code. The hash table references next-hop group entries
      from the next-hop group that is currently in the primary role (i.e. not
      spare). During the transition from primary to spare, the table references a
      mix of both the primary group and the spare. When a next hop is deleted,
      the corresponding buckets are not set to NULL, but instead marked as empty,
      so that the pointer is valid and can be used by the forwarding code. The
      buckets are then migrated to a new next-hop group entry during upkeep. The
      only times that the hash table is invalid is the very beginning and very
      end of its lifetime. Between those points, it is always kept valid.
      
      This patch introduces the core support code itself. It does not handle
      notifications towards drivers, which are kept as if the group were an mpath
      one. It does not handle netlink either. The only bit currently exposed to
      user space is the new next-hop group type, and that is currently bounced.
      There is therefore no way to actually access this code.
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      283a72a5