Commit cab9572a authored by David S. Miller's avatar David S. Miller

Merge branch 'mlxsw-Further-MC-awareness-configuration'

Ido Schimmel says:

====================
mlxsw: Further MC-awareness configuration

Petr says:

Due to an issue in Spectrum chips, when unicast traffic shares the same
queue as BUM traffic, and there is congestion, the BUM traffic is
admitted to the queue anyway, thus pushing out all UC traffic. In order
to give unicast traffic precedence over BUM traffic, multicast-aware
mode is now configured on all ports. Under MC-aware mode, egress TCs
8..15 are used for BUM traffic, which has its own dedicated pool.

This patch set improves the way that the MC pool and the higher-order
TCs are integrated into the system.

In patch #1, shaper at the higher TCs is configured to the same value
that it has by default. It's better to have the corresponding artifact
in the code explicitly.

The 8 following patches gradually extend the devlink handling in mlxsw
to support the extra TCs and the new MC pool.

Patch #2 changes the way that pools are indexed in mlxsw. Instead of
using (FW index, direction) tuple to identify the pool and the
associated cache, mlxsw now uses devlink index. This change is necessary
because the new pool 15 is not contiguously adjacent to the
currently-used pools 0..3, and because it's only relevant on egress.
Using devlink index relaxes the requirement for symmetry and adjacency
imposed by using FW indexing.

In patch #3, the assumption that number of ingress TCs matches that of
egress TCs is relaxed to allow exposition of egress TCs 8..15.

In patches #4, #5 and #6, support for infinite quotas is introduced.
Infinite quotas are reported as taking all the memory in the system, but
actually use a mechanism where the infinity is configured explicitly.

In patches #7 and #8, support for configuring static pool sizes in
introduced. Statically-sized pools have been supported for a while now,
but during initialization, all pools have dynamic size. The patches
allow there to be a mix of by-default static and dynamic pools.

In patches #9 and #10, pool 15 resp. per-priority MC quotas are
explicitly configured to be in sync with the current recommendation for
handling BUM traffic in Spectrum chips.

In the following 3 patches, an mlxsw-specific selftest is added to test
the MC-awareness configuration.

First in patches #11 and #12, lib.sh is extended with functions to
collect ethtool stats, and to manage port MTU.

Then in patch #13 the selftest itself is added.
====================
Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
parents faa08325 b5638d46
......@@ -8336,8 +8336,15 @@ MLXSW_ITEM32(reg, sbpr, dir, 0x00, 24, 2);
*/
MLXSW_ITEM32(reg, sbpr, pool, 0x00, 0, 4);
/* reg_sbpr_infi_size
* Size is infinite.
* Access: RW
*/
MLXSW_ITEM32(reg, sbpr, infi_size, 0x04, 31, 1);
/* reg_sbpr_size
* Pool size in buffer cells.
* Reserved when infi_size = 1.
* Access: RW
*/
MLXSW_ITEM32(reg, sbpr, size, 0x04, 0, 24);
......@@ -8355,13 +8362,15 @@ MLXSW_ITEM32(reg, sbpr, mode, 0x08, 0, 4);
static inline void mlxsw_reg_sbpr_pack(char *payload, u8 pool,
enum mlxsw_reg_sbxx_dir dir,
enum mlxsw_reg_sbpr_mode mode, u32 size)
enum mlxsw_reg_sbpr_mode mode, u32 size,
bool infi_size)
{
MLXSW_REG_ZERO(sbpr, payload);
mlxsw_reg_sbpr_pool_set(payload, pool);
mlxsw_reg_sbpr_dir_set(payload, dir);
mlxsw_reg_sbpr_mode_set(payload, mode);
mlxsw_reg_sbpr_size_set(payload, size);
mlxsw_reg_sbpr_infi_size_set(payload, infi_size);
}
/* SBCM - Shared Buffer Class Management Register
......@@ -8409,6 +8418,12 @@ MLXSW_ITEM32(reg, sbcm, min_buff, 0x18, 0, 24);
#define MLXSW_REG_SBXX_DYN_MAX_BUFF_MIN 1
#define MLXSW_REG_SBXX_DYN_MAX_BUFF_MAX 14
/* reg_sbcm_infi_max
* Max buffer is infinite.
* Access: RW
*/
MLXSW_ITEM32(reg, sbcm, infi_max, 0x1C, 31, 1);
/* reg_sbcm_max_buff
* When the pool associated to the port-pg/tclass is configured to
* static, Maximum buffer size for the limiter configured in cells.
......@@ -8418,6 +8433,7 @@ MLXSW_ITEM32(reg, sbcm, min_buff, 0x18, 0, 24);
* 0: 0
* i: (1/128)*2^(i-1), for i=1..14
* 0xFF: Infinity
* Reserved when infi_max = 1.
* Access: RW
*/
MLXSW_ITEM32(reg, sbcm, max_buff, 0x1C, 0, 24);
......@@ -8430,7 +8446,8 @@ MLXSW_ITEM32(reg, sbcm, pool, 0x24, 0, 4);
static inline void mlxsw_reg_sbcm_pack(char *payload, u8 local_port, u8 pg_buff,
enum mlxsw_reg_sbxx_dir dir,
u32 min_buff, u32 max_buff, u8 pool)
u32 min_buff, u32 max_buff,
bool infi_max, u8 pool)
{
MLXSW_REG_ZERO(sbcm, payload);
mlxsw_reg_sbcm_local_port_set(payload, local_port);
......@@ -8438,6 +8455,7 @@ static inline void mlxsw_reg_sbcm_pack(char *payload, u8 local_port, u8 pg_buff,
mlxsw_reg_sbcm_dir_set(payload, dir);
mlxsw_reg_sbcm_min_buff_set(payload, min_buff);
mlxsw_reg_sbcm_max_buff_set(payload, max_buff);
mlxsw_reg_sbcm_infi_max_set(payload, infi_max);
mlxsw_reg_sbcm_pool_set(payload, pool);
}
......
......@@ -2804,6 +2804,13 @@ static int mlxsw_sp_port_ets_init(struct mlxsw_sp_port *mlxsw_sp_port)
MLXSW_REG_QEEC_MAS_DIS);
if (err)
return err;
err = mlxsw_sp_port_ets_maxrate_set(mlxsw_sp_port,
MLXSW_REG_QEEC_HIERARCY_TC,
i + 8, i,
MLXSW_REG_QEEC_MAS_DIS);
if (err)
return err;
}
/* Map all priorities to traffic class 0. */
......
#!/bin/bash
# SPDX-License-Identifier: GPL-2.0
#
# A test for switch behavior under MC overload. An issue in Spectrum chips
# causes throughput of UC traffic to drop severely when a switch is under heavy
# MC load. This issue can be overcome by putting the switch to MC-aware mode.
# This test verifies that UC performance stays intact even as the switch is
# under MC flood, and therefore that the MC-aware mode is enabled and correctly
# configured.
#
# Because mlxsw throttles CPU port, the traffic can't actually reach userspace
# at full speed. That makes it impossible to use iperf3 to simply measure the
# throughput, because many packets (that reach $h3) don't get to the kernel at
# all even in UDP mode (the situation is even worse in TCP mode, where one can't
# hope to see more than a couple Mbps).
#
# So instead we send traffic with mausezahn and use RX ethtool counters at $h3.
# Multicast traffic is untagged, unicast traffic is tagged with PCP 1. Therefore
# each gets a different priority and we can use per-prio ethtool counters to
# measure the throughput. In order to avoid prioritizing unicast traffic, prio
# qdisc is installed on $swp3 and maps all priorities to the same band #7 (and
# thus TC 0).
#
# Mausezahn can't actually saturate the links unless it's using large frames.
# Thus we set MTU to 10K on all involved interfaces. Then both unicast and
# multicast traffic uses 8K frames.
#
# +-----------------------+ +----------------------------------+
# | H1 | | H2 |
# | | | unicast --> + $h2.111 |
# | | | traffic | 192.0.2.129/28 |
# | multicast | | | e-qos-map 0:1 |
# | traffic | | | |
# | $h1 + <----- | | + $h2 |
# +-----|-----------------+ +--------------|-------------------+
# | |
# +-----|-------------------------------------------------|-------------------+
# | + $swp1 + $swp2 |
# | | >1Gbps | >1Gbps |
# | +---|----------------+ +----------|----------------+ |
# | | + $swp1.1 | | + $swp2.111 | |
# | | BR1 | SW | BR111 | |
# | | + $swp3.1 | | + $swp3.111 | |
# | +---|----------------+ +----------|----------------+ |
# | \_________________________________________________/ |
# | | |
# | + $swp3 |
# | | 1Gbps bottleneck |
# | | prio qdisc: {0..7} -> 7 |
# +------------------------------------|--------------------------------------+
# |
# +--|-----------------+
# | + $h3 H3 |
# | | |
# | + $h3.111 |
# | 192.0.2.130/28 |
# +--------------------+
ALL_TESTS="
ping_ipv4
test_mc_aware
"
lib_dir=$(dirname $0)/../../../net/forwarding
NUM_NETIFS=6
source $lib_dir/lib.sh
h1_create()
{
simple_if_init $h1
mtu_set $h1 10000
}
h1_destroy()
{
mtu_restore $h1
simple_if_fini $h1
}
h2_create()
{
simple_if_init $h2
mtu_set $h2 10000
vlan_create $h2 111 v$h2 192.0.2.129/28
ip link set dev $h2.111 type vlan egress-qos-map 0:1
}
h2_destroy()
{
vlan_destroy $h2 111
mtu_restore $h2
simple_if_fini $h2
}
h3_create()
{
simple_if_init $h3
mtu_set $h3 10000
vlan_create $h3 111 v$h3 192.0.2.130/28
}
h3_destroy()
{
vlan_destroy $h3 111
mtu_restore $h3
simple_if_fini $h3
}
switch_create()
{
ip link set dev $swp1 up
mtu_set $swp1 10000
ip link set dev $swp2 up
mtu_set $swp2 10000
ip link set dev $swp3 up
mtu_set $swp3 10000
vlan_create $swp2 111
vlan_create $swp3 111
ethtool -s $swp3 speed 1000 autoneg off
tc qdisc replace dev $swp3 root handle 3: \
prio bands 8 priomap 7 7 7 7 7 7 7 7
ip link add name br1 type bridge vlan_filtering 0
ip link set dev br1 up
ip link set dev $swp1 master br1
ip link set dev $swp3 master br1
ip link add name br111 type bridge vlan_filtering 0
ip link set dev br111 up
ip link set dev $swp2.111 master br111
ip link set dev $swp3.111 master br111
}
switch_destroy()
{
ip link del dev br111
ip link del dev br1
tc qdisc del dev $swp3 root handle 3:
ethtool -s $swp3 autoneg on
vlan_destroy $swp3 111
vlan_destroy $swp2 111
mtu_restore $swp3
ip link set dev $swp3 down
mtu_restore $swp2
ip link set dev $swp2 down
mtu_restore $swp1
ip link set dev $swp1 down
}
setup_prepare()
{
h1=${NETIFS[p1]}
swp1=${NETIFS[p2]}
swp2=${NETIFS[p3]}
h2=${NETIFS[p4]}
swp3=${NETIFS[p5]}
h3=${NETIFS[p6]}
h3mac=$(mac_get $h3)
vrf_prepare
h1_create
h2_create
h3_create
switch_create
}
cleanup()
{
pre_cleanup
switch_destroy
h3_destroy
h2_destroy
h1_destroy
vrf_cleanup
}
ping_ipv4()
{
ping_test $h2 192.0.2.130
}
humanize()
{
local speed=$1; shift
for unit in bps Kbps Mbps Gbps; do
if (($(echo "$speed < 1024" | bc))); then
break
fi
speed=$(echo "scale=1; $speed / 1024" | bc)
done
echo "$speed${unit}"
}
rate()
{
local t0=$1; shift
local t1=$1; shift
local interval=$1; shift
echo $((8 * (t1 - t0) / interval))
}
check_rate()
{
local rate=$1; shift
local min=$1; shift
local what=$1; shift
if ((rate > min)); then
return 0
fi
echo "$what $(humanize $ir) < $(humanize $min_ingress)" > /dev/stderr
return 1
}
measure_uc_rate()
{
local what=$1; shift
local interval=10
local i
local ret=0
# Dips in performance might cause momentary ingress rate to drop below
# 1Gbps. That wouldn't saturate egress and MC would thus get through,
# seemingly winning bandwidth on account of UC. Demand at least 2Gbps
# average ingress rate to somewhat mitigate this.
local min_ingress=2147483648
mausezahn $h2.111 -p 8000 -A 192.0.2.129 -B 192.0.2.130 -c 0 \
-a own -b $h3mac -t udp -q &
sleep 1
for i in {5..0}; do
local t0=$(ethtool_stats_get $h3 rx_octets_prio_1)
local u0=$(ethtool_stats_get $swp2 rx_octets_prio_1)
sleep $interval
local t1=$(ethtool_stats_get $h3 rx_octets_prio_1)
local u1=$(ethtool_stats_get $swp2 rx_octets_prio_1)
local ir=$(rate $u0 $u1 $interval)
local er=$(rate $t0 $t1 $interval)
if check_rate $ir $min_ingress "$what ingress rate"; then
break
fi
# Fail the test if we can't get the throughput.
if ((i == 0)); then
ret=1
fi
done
# Suppress noise from killing mausezahn.
{ kill %% && wait; } 2>/dev/null
echo $ir $er
exit $ret
}
test_mc_aware()
{
RET=0
local -a uc_rate
uc_rate=($(measure_uc_rate "UC-only"))
check_err $? "Could not get high enough UC-only ingress rate"
local ucth1=${uc_rate[1]}
mausezahn $h1 -p 8000 -c 0 -a own -b bc -t udp -q &
local d0=$(date +%s)
local t0=$(ethtool_stats_get $h3 rx_octets_prio_0)
local u0=$(ethtool_stats_get $swp1 rx_octets_prio_0)
local -a uc_rate_2
uc_rate_2=($(measure_uc_rate "UC+MC"))
check_err $? "Could not get high enough UC+MC ingress rate"
local ucth2=${uc_rate_2[1]}
local d1=$(date +%s)
local t1=$(ethtool_stats_get $h3 rx_octets_prio_0)
local u1=$(ethtool_stats_get $swp1 rx_octets_prio_0)
local deg=$(bc <<< "
scale=2
ret = 100 * ($ucth1 - $ucth2) / $ucth1
if (ret > 0) { ret } else { 0 }
")
check_err $(bc <<< "$deg > 10")
local interval=$((d1 - d0))
local mc_ir=$(rate $u0 $u1 $interval)
local mc_er=$(rate $t0 $t1 $interval)
# Suppress noise from killing mausezahn.
{ kill %% && wait; } 2>/dev/null
log_test "UC performace under MC overload"
echo "UC-only throughput $(humanize $ucth1)"
echo "UC+MC throughput $(humanize $ucth2)"
echo "Degradation $deg %"
echo
echo "Full report:"
echo " UC only:"
echo " ingress UC throughput $(humanize ${uc_rate[0]})"
echo " egress UC throughput $(humanize ${uc_rate[1]})"
echo " UC+MC:"
echo " ingress UC throughput $(humanize ${uc_rate_2[0]})"
echo " egress UC throughput $(humanize ${uc_rate_2[1]})"
echo " ingress MC throughput $(humanize $mc_ir)"
echo " egress MC throughput $(humanize $mc_er)"
}
trap cleanup EXIT
setup_prepare
setup_wait
tests_run
exit $EXIT_STATUS
......@@ -494,6 +494,14 @@ tc_rule_stats_get()
| jq '.[1].options.actions[].stats.packets'
}
ethtool_stats_get()
{
local dev=$1; shift
local stat=$1; shift
ethtool -S $dev | grep "^ *$stat:" | head -n 1 | cut -d: -f2
}
mac_get()
{
local if_name=$1
......@@ -541,6 +549,23 @@ forwarding_restore()
sysctl_restore net.ipv4.conf.all.forwarding
}
declare -A MTU_ORIG
mtu_set()
{
local dev=$1; shift
local mtu=$1; shift
MTU_ORIG["$dev"]=$(ip -j link show dev $dev | jq -e '.[].mtu')
ip link set dev $dev mtu $mtu
}
mtu_restore()
{
local dev=$1; shift
ip link set dev $dev mtu ${MTU_ORIG["$dev"]}
}
tc_offload_check()
{
local num_netifs=${1:-$NUM_NETIFS}
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment