Commit 8150f0cf authored by Jakub Kicinski's avatar Jakub Kicinski

Merge branch 'bridge-mcast-extensions-for-evpn'

Ido Schimmel says:

====================
bridge: mcast: Extensions for EVPN

tl;dr
=====

This patchset creates feature parity between user space and the kernel
and allows the former to install and replace MDB port group entries with
a source list and associated filter mode. This is required for EVPN use
cases where multicast state is not derived from snooped IGMP/MLD
packets, but instead derived from EVPN routes exchanged by the control
plane in user space.

Background
==========

IGMPv3 [1] and MLDv2 [2] differ from earlier versions of the protocols
in that they add support for source-specific multicast. That is, hosts
can advertise interest in listening to a particular multicast address
only from specific source addresses or from all sources except for
specific source addresses.

In kernel 5.10 [3][4], the bridge driver gained the ability to snoop
IGMPv3/MLDv2 packets and install corresponding MDB port group entries.
For example, a snooped IGMPv3 Membership Report that contains a single
MODE_IS_EXCLUDE record for group 239.10.10.10 with sources 192.0.2.1,
192.0.2.2, 192.0.2.20 and 192.0.2.21 would trigger the creation of these
entries:

 # bridge -d mdb show
 dev br0 port veth1 grp 239.10.10.10 src 192.0.2.21 temp filter_mode include proto kernel  blocked
 dev br0 port veth1 grp 239.10.10.10 src 192.0.2.20 temp filter_mode include proto kernel  blocked
 dev br0 port veth1 grp 239.10.10.10 src 192.0.2.2 temp filter_mode include proto kernel  blocked
 dev br0 port veth1 grp 239.10.10.10 src 192.0.2.1 temp filter_mode include proto kernel  blocked
 dev br0 port veth1 grp 239.10.10.10 temp filter_mode exclude source_list 192.0.2.21/0.00,192.0.2.20/0.00,192.0.2.2/0.00,192.0.2.1/0.00 proto kernel

While the kernel can install and replace entries with a filter mode and
source list, user space cannot. It can only add EXCLUDE entries with an
empty source list, which is sufficient for IGMPv2/MLDv1, but not for
IGMPv3/MLDv2.

Use cases where the multicast state is not derived from snooped packets,
but instead derived from routes exchanged by the user space control
plane require feature parity between user space and the kernel in terms
of MDB configuration. Such a use case is detailed in the next section.

Motivation
==========

RFC 7432 [5] defines a "MAC/IP Advertisement route" (type 2) [6] that
allows NVE switches in the EVPN network to advertise and learn
reachability information for unicast MAC addresses. Traffic destined to
a unicast MAC address can therefore be selectively forwarded to a single
NVE switch behind which the MAC is located.

The same is not true for IP multicast traffic. Such traffic is simply
flooded as BUM to all NVE switches in the broadcast domain (BD),
regardless if a switch has interested receivers for the multicast stream
or not. This is especially problematic for overlay networks that make
heavy use of multicast.

The issue is addressed by RFC 9251 [7] that defines a "Selective
Multicast Ethernet Tag Route" (type 6) [8] which allows NVE switches in
the EVPN network to advertise multicast streams that they are interested
in. This is done by having each switch suppress IGMP/MLD packets from
being transmitted to the NVE network and instead communicate the
information over BGP to other switches.

As far as the bridge driver is concerned, the above means that the
multicast state (i.e., {multicast address, group timer, filter-mode,
(source records)}) for the VXLAN bridge port is not populated by the
kernel from snooped IGMP/MLD packets (they are suppressed), but instead
by user space. Specifically, by the routing daemon that is exchanging
EVPN routes with other NVE switches.

Changes are obviously also required in the VXLAN driver, but they are
the subject of future patchsets. See the "Future work" section.

Implementation
==============

The user interface is extended to allow user space to specify the filter
mode of the MDB port group entry and its source list. Replace support is
also added so that user space would not need to remove an entry and
re-add it only to edit its source list or filter mode, as that would
result in packet loss. Example usage:

 # bridge mdb replace dev br0 port dummy10 grp 239.1.1.1 permanent \
	source_list 192.0.2.1,192.0.2.3 filter_mode exclude proto zebra
 # bridge -d -s mdb show
 dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.3 permanent filter_mode include proto zebra  blocked    0.00
 dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.1 permanent filter_mode include proto zebra  blocked    0.00
 dev br0 port dummy10 grp 239.1.1.1 permanent filter_mode exclude source_list 192.0.2.3/0.00,192.0.2.1/0.00 proto zebra     0.00

The netlink interface is extended with a few new attributes in the
RTM_NEWMDB request message:

[ struct nlmsghdr ]
[ struct br_port_msg ]
[ MDBA_SET_ENTRY ]
	struct br_mdb_entry
[ MDBA_SET_ENTRY_ATTRS ]
	[ MDBE_ATTR_SOURCE ]
		struct in_addr / struct in6_addr
	[ MDBE_ATTR_SRC_LIST ]		// new
		[ MDBE_SRC_LIST_ENTRY ]
			[ MDBE_SRCATTR_ADDRESS ]
				struct in_addr / struct in6_addr
		[ ...]
	[ MDBE_ATTR_GROUP_MODE ]	// new
		u8
	[ MDBE_ATTR_RTPORT ]		// new
		u8

No changes are required in RTM_NEWMDB responses and notifications, as
all the information can already be dumped by the kernel today.

Testing
=======

Tested with existing bridge multicast selftests: bridge_igmp.sh,
bridge_mdb_port_down.sh, bridge_mdb.sh, bridge_mld.sh,
bridge_vlan_mcast.sh.

In addition, added many new test cases for existing as well as for new
MDB functionality.

Patchset overview
=================

Patches #1-#8 are non-functional preparations for the core changes in
later patches.

Patches #9-#10 allow user space to install (*, G) entries with a source
list and associated filter mode. Specifically, patch #9 adds the
necessary kernel plumbing and patch #10 exposes the new functionality to
user space via a few new attributes.

Patch #11 allows user space to specify the routing protocol of new MDB
port group entries so that a routing daemon could differentiate between
entries installed by it and those installed by an administrator.

Patch #12 allows user space to replace MDB port group entries. This is
useful, for example, when user space wants to add a new source to a
source list. Instead of deleting a (*, G) entry and re-adding it with an
extended source list (which would result in packet loss), user space can
simply replace the current entry.

Patches #13-#14 add tests for existing MDB functionality as well as for
all new functionality added in this patchset.

Future work
===========

The VXLAN driver will need to be extended with an MDB so that it could
selectively forward IP multicast traffic to NVE switches with interested
receivers instead of simply flooding it to all switches as BUM.

The idea is to reuse the existing MDB interface for the VXLAN driver in
a similar way to how the FDB interface is shared between the bridge and
VXLAN drivers.

From command line perspective, configuration will look as follows:

 # bridge mdb add dev br0 port vxlan0 grp 239.1.1.1 permanent \
	filter_mode exclude source_list 198.50.100.1,198.50.100.2

 # bridge mdb add dev vxlan0 port vxlan0 grp 239.1.1.1 permanent \
	filter_mode include source_list 198.50.100.3,198.50.100.4 \
	dst 192.0.2.1 dst_port 4789 src_vni 2

 # bridge mdb add dev vxlan0 port vxlan0 grp 239.1.1.1 permanent \
	filter_mode exclude source_list 198.50.100.1,198.50.100.2 \
	dst 192.0.2.2 dst_port 4789 src_vni 2

Where the first command is enabled by this set, but the next two will be
the subject of future work.

From netlink perspective, the existing PF_BRIDGE/RTM_*MDB messages will
be extended to the VXLAN driver. This means that a few new attributes
will be added (e.g., 'MDBE_ATTR_SRC_VNI') and that the handlers for
these messages will need to move to net/core/rtnetlink.c. The rtnetlink
code will call into the appropriate driver based on the ifindex
specified in the ancillary header.

iproute2 patches can be found here [9].

Changelog
=========

Since v1 [10]:

* Patch #12: Remove extack from br_mdb_replace_group_sg().
* Patch #12: Change 'nlflags' to u16 and move it after 'filter_mode' to
  pack the structure.

Since RFC [11]:

* Patch #6: New patch.
* Patch #9: Use an array instead of a list to store source entries.
* Patch #10: Use an array instead of list to store source entries.
* Patch #10: Drop br_mdb_config_attrs_fini().
* Patch #11: Reject protocol for host entries.
* Patch #13: New patch.
* Patch #14: New patch.

[1] https://datatracker.ietf.org/doc/html/rfc3376
[2] https://www.rfc-editor.org/rfc/rfc3810
[3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6af52ae2ed14a6bc756d5606b29097dfd76740b8
[4] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=68d4fd30c83b1b208e08c954cd45e6474b148c87
[5] https://datatracker.ietf.org/doc/html/rfc7432
[6] https://datatracker.ietf.org/doc/html/rfc7432#section-7.2
[7] https://datatracker.ietf.org/doc/html/rfc9251
[8] https://datatracker.ietf.org/doc/html/rfc9251#section-9.1
[9] https://github.com/idosch/iproute2/commits/submit/mdb_v1
[10] https://lore.kernel.org/netdev/20221208152839.1016350-1-idosch@nvidia.com/
[11] https://lore.kernel.org/netdev/20221018120420.561846-1-idosch@nvidia.com/
====================

Link: https://lore.kernel.org/r/20221210145633.1328511-1-idosch@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
parents 02abf84a b6d00da0
...@@ -723,10 +723,31 @@ enum { ...@@ -723,10 +723,31 @@ enum {
enum { enum {
MDBE_ATTR_UNSPEC, MDBE_ATTR_UNSPEC,
MDBE_ATTR_SOURCE, MDBE_ATTR_SOURCE,
MDBE_ATTR_SRC_LIST,
MDBE_ATTR_GROUP_MODE,
MDBE_ATTR_RTPROT,
__MDBE_ATTR_MAX, __MDBE_ATTR_MAX,
}; };
#define MDBE_ATTR_MAX (__MDBE_ATTR_MAX - 1) #define MDBE_ATTR_MAX (__MDBE_ATTR_MAX - 1)
/* per mdb entry source */
enum {
MDBE_SRC_LIST_UNSPEC,
MDBE_SRC_LIST_ENTRY,
__MDBE_SRC_LIST_MAX,
};
#define MDBE_SRC_LIST_MAX (__MDBE_SRC_LIST_MAX - 1)
/* per mdb entry per source attributes
* these are embedded in MDBE_SRC_LIST_ENTRY
*/
enum {
MDBE_SRCATTR_UNSPEC,
MDBE_SRCATTR_ADDRESS,
__MDBE_SRCATTR_MAX,
};
#define MDBE_SRCATTR_MAX (__MDBE_SRCATTR_MAX - 1)
/* Embedded inside LINK_XSTATS_TYPE_BRIDGE */ /* Embedded inside LINK_XSTATS_TYPE_BRIDGE */
enum { enum {
BRIDGE_XSTATS_UNSPEC, BRIDGE_XSTATS_UNSPEC,
......
This diff is collapsed.
...@@ -552,7 +552,8 @@ static void br_multicast_fwd_src_remove(struct net_bridge_group_src *src, ...@@ -552,7 +552,8 @@ static void br_multicast_fwd_src_remove(struct net_bridge_group_src *src,
continue; continue;
if (p->rt_protocol != RTPROT_KERNEL && if (p->rt_protocol != RTPROT_KERNEL &&
(p->flags & MDB_PG_FLAGS_PERMANENT)) (p->flags & MDB_PG_FLAGS_PERMANENT) &&
!(src->flags & BR_SGRP_F_USER_ADDED))
break; break;
if (fastleave) if (fastleave)
...@@ -650,18 +651,23 @@ static void br_multicast_destroy_group_src(struct net_bridge_mcast_gc *gc) ...@@ -650,18 +651,23 @@ static void br_multicast_destroy_group_src(struct net_bridge_mcast_gc *gc)
kfree_rcu(src, rcu); kfree_rcu(src, rcu);
} }
void br_multicast_del_group_src(struct net_bridge_group_src *src, void __br_multicast_del_group_src(struct net_bridge_group_src *src)
bool fastleave)
{ {
struct net_bridge *br = src->pg->key.port->br; struct net_bridge *br = src->pg->key.port->br;
br_multicast_fwd_src_remove(src, fastleave);
hlist_del_init_rcu(&src->node); hlist_del_init_rcu(&src->node);
src->pg->src_ents--; src->pg->src_ents--;
hlist_add_head(&src->mcast_gc.gc_node, &br->mcast_gc_list); hlist_add_head(&src->mcast_gc.gc_node, &br->mcast_gc_list);
queue_work(system_long_wq, &br->mcast_gc_work); queue_work(system_long_wq, &br->mcast_gc_work);
} }
void br_multicast_del_group_src(struct net_bridge_group_src *src,
bool fastleave)
{
br_multicast_fwd_src_remove(src, fastleave);
__br_multicast_del_group_src(src);
}
static void br_multicast_destroy_port_group(struct net_bridge_mcast_gc *gc) static void br_multicast_destroy_port_group(struct net_bridge_mcast_gc *gc)
{ {
struct net_bridge_port_group *pg; struct net_bridge_port_group *pg;
...@@ -1232,7 +1238,7 @@ br_multicast_find_group_src(struct net_bridge_port_group *pg, struct br_ip *ip) ...@@ -1232,7 +1238,7 @@ br_multicast_find_group_src(struct net_bridge_port_group *pg, struct br_ip *ip)
return NULL; return NULL;
} }
static struct net_bridge_group_src * struct net_bridge_group_src *
br_multicast_new_group_src(struct net_bridge_port_group *pg, struct br_ip *src_ip) br_multicast_new_group_src(struct net_bridge_port_group *pg, struct br_ip *src_ip)
{ {
struct net_bridge_group_src *grp_src; struct net_bridge_group_src *grp_src;
......
...@@ -93,11 +93,21 @@ struct bridge_mcast_stats { ...@@ -93,11 +93,21 @@ struct bridge_mcast_stats {
struct u64_stats_sync syncp; struct u64_stats_sync syncp;
}; };
struct br_mdb_src_entry {
struct br_ip addr;
};
struct br_mdb_config { struct br_mdb_config {
struct net_bridge *br; struct net_bridge *br;
struct net_bridge_port *p; struct net_bridge_port *p;
struct br_mdb_entry *entry; struct br_mdb_entry *entry;
struct br_ip group; struct br_ip group;
bool src_entry;
u8 filter_mode;
u16 nlflags;
struct br_mdb_src_entry *src_entries;
int num_src_entries;
u8 rt_protocol;
}; };
#endif #endif
...@@ -300,6 +310,7 @@ struct net_bridge_fdb_flush_desc { ...@@ -300,6 +310,7 @@ struct net_bridge_fdb_flush_desc {
#define BR_SGRP_F_DELETE BIT(0) #define BR_SGRP_F_DELETE BIT(0)
#define BR_SGRP_F_SEND BIT(1) #define BR_SGRP_F_SEND BIT(1)
#define BR_SGRP_F_INSTALLED BIT(2) #define BR_SGRP_F_INSTALLED BIT(2)
#define BR_SGRP_F_USER_ADDED BIT(3)
struct net_bridge_mcast_gc { struct net_bridge_mcast_gc {
struct hlist_node gc_node; struct hlist_node gc_node;
...@@ -974,6 +985,10 @@ void br_multicast_sg_add_exclude_ports(struct net_bridge_mdb_entry *star_mp, ...@@ -974,6 +985,10 @@ void br_multicast_sg_add_exclude_ports(struct net_bridge_mdb_entry *star_mp,
struct net_bridge_port_group *sg); struct net_bridge_port_group *sg);
struct net_bridge_group_src * struct net_bridge_group_src *
br_multicast_find_group_src(struct net_bridge_port_group *pg, struct br_ip *ip); br_multicast_find_group_src(struct net_bridge_port_group *pg, struct br_ip *ip);
struct net_bridge_group_src *
br_multicast_new_group_src(struct net_bridge_port_group *pg,
struct br_ip *src_ip);
void __br_multicast_del_group_src(struct net_bridge_group_src *src);
void br_multicast_del_group_src(struct net_bridge_group_src *src, void br_multicast_del_group_src(struct net_bridge_group_src *src,
bool fastleave); bool fastleave);
void br_multicast_ctx_init(struct net_bridge *br, void br_multicast_ctx_init(struct net_bridge *br,
......
...@@ -3,6 +3,7 @@ ...@@ -3,6 +3,7 @@
TEST_PROGS = bridge_igmp.sh \ TEST_PROGS = bridge_igmp.sh \
bridge_locked_port.sh \ bridge_locked_port.sh \
bridge_mdb.sh \ bridge_mdb.sh \
bridge_mdb_host.sh \
bridge_mdb_port_down.sh \ bridge_mdb_port_down.sh \
bridge_mld.sh \ bridge_mld.sh \
bridge_port_isolation.sh \ bridge_port_isolation.sh \
......
#!/bin/bash
# SPDX-License-Identifier: GPL-2.0
#
# Verify that adding host mdb entries work as intended for all types of
# multicast filters: ipv4, ipv6, and mac
ALL_TESTS="mdb_add_del_test"
NUM_NETIFS=2
TEST_GROUP_IP4="225.1.2.3"
TEST_GROUP_IP6="ff02::42"
TEST_GROUP_MAC="01:00:01:c0:ff:ee"
source lib.sh
h1_create()
{
simple_if_init $h1 192.0.2.1/24 2001:db8:1::1/64
}
h1_destroy()
{
simple_if_fini $h1 192.0.2.1/24 2001:db8:1::1/64
}
switch_create()
{
# Enable multicast filtering
ip link add dev br0 type bridge mcast_snooping 1
ip link set dev $swp1 master br0
ip link set dev br0 up
ip link set dev $swp1 up
}
switch_destroy()
{
ip link set dev $swp1 down
ip link del dev br0
}
setup_prepare()
{
h1=${NETIFS[p1]}
swp1=${NETIFS[p2]}
vrf_prepare
h1_create
switch_create
}
cleanup()
{
pre_cleanup
switch_destroy
h1_destroy
vrf_cleanup
}
do_mdb_add_del()
{
local group=$1
local flag=$2
RET=0
bridge mdb add dev br0 port br0 grp $group $flag 2>/dev/null
check_err $? "Failed adding $group to br0, port br0"
if [ -z "$flag" ]; then
flag="temp"
fi
bridge mdb show dev br0 | grep $group | grep -q $flag 2>/dev/null
check_err $? "$group not added with $flag flag"
bridge mdb del dev br0 port br0 grp $group 2>/dev/null
check_err $? "Failed deleting $group from br0, port br0"
bridge mdb show dev br0 | grep -q $group >/dev/null
check_err_fail 1 $? "$group still in mdb after delete"
log_test "MDB add/del group $group to bridge port br0"
}
mdb_add_del_test()
{
do_mdb_add_del $TEST_GROUP_MAC permanent
do_mdb_add_del $TEST_GROUP_IP4
do_mdb_add_del $TEST_GROUP_IP6
}
trap cleanup EXIT
setup_prepare
setup_wait
tests_run
exit $EXIT_STATUS
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment