Commit 67d25ce8 authored by Jakub Kicinski's avatar Jakub Kicinski

Merge branch 'nexthop-preparations-for-resilient-next-hop-groups'

Petr Machata says:

====================
nexthop: Preparations for resilient next-hop groups

At this moment, there is only one type of next-hop group: an mpath group.
Mpath groups implement the hash-threshold algorithm, described in RFC
2992[1].

To select a next hop, hash-threshold algorithm first assigns a range of
hashes to each next hop in the group, and then selects the next hop by
comparing the SKB hash with the individual ranges. When a next hop is
removed from the group, the ranges are recomputed, which leads to
reassignment of parts of hash space from one next hop to another. RFC 2992
illustrates it thus:

             +-------+-------+-------+-------+-------+
             |   1   |   2   |   3   |   4   |   5   |
             +-------+-+-----+---+---+-----+-+-------+
             |    1    |    2    |    4    |    5    |
             +---------+---------+---------+---------+

              Before and after deletion of next hop 3
	      under the hash-threshold algorithm.

Note how next hop 2 gave up part of the hash space in favor of next hop 1,
and 4 in favor of 5. While there will usually be some overlap between the
previous and the new distribution, some traffic flows change the next hop
that they resolve to.

If a multipath group is used for load-balancing between multiple servers,
this hash space reassignment causes an issue that packets from a single
flow suddenly end up arriving at a server that does not expect them, which
may lead to TCP reset.

If a multipath group is used for load-balancing among available paths to
the same server, the issue is that different latencies and reordering along
the way causes the packets to arrive in wrong order.

Resilient hashing is a technique to address the above problem. Resilient
next-hop group has another layer of indirection between the group itself
and its constituent next hops: a hash table. The selection algorithm uses a
straightforward modulo operation to choose a hash bucket, and then reads
the next hop that this bucket contains, and forwards traffic there.

This indirection brings an important feature. In the hash-threshold
algorithm, the range of hashes associated with a next hop must be
continuous. With a hash table, mapping between the hash table buckets and
the individual next hops is arbitrary. Therefore when a next hop is deleted
the buckets that held it are simply reassigned to other next hops:

             +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
             |1|1|1|1|2|2|2|2|3|3|3|3|4|4|4|4|5|5|5|5|
             +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
	                      v v v v
             +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
             |1|1|1|1|2|2|2|2|1|2|4|5|4|4|4|4|5|5|5|5|
             +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

              Before and after deletion of next hop 3
	      under the resilient hashing algorithm.

When weights of next hops in a group are altered, it may be possible to
choose a subset of buckets that are currently not used for forwarding
traffic, and use those to satisfy the new next-hop distribution demands,
keeping the "busy" buckets intact. This way, established flows are ideally
kept being forwarded to the same endpoints through the same paths as before
the next-hop group change.

This patchset prepares the next-hop code for eventual introduction of
resilient hashing groups.

- Patches #1-#4 carry otherwise disjoint changes that just remove certain
  assumptions in the next-hop code.

- Patches #5-#6 extend the in-kernel next-hop notifiers to support more
  next-hop group types.

- Patches #7-#12 refactor RTNL message handlers. Resilient next-hop groups
  will introduce a new logical object, a hash table bucket. It turns out
  that handling bucket-related messages is similar to how next-hop messages
  are handled. These patches extract the commonalities into reusable
  components.

The plan is to contribute approximately the following patchsets:

1) Nexthop policy refactoring (already pushed)
2) Preparations for resilient next hop groups (this patchset)
3) Implementation of resilient next hop group
4) Netdevsim offload plus a suite of selftests
5) Preparations for mlxsw offload of resilient next-hop groups
6) mlxsw offload including selftests

Interested parties can look at the current state of the code at [2] and
[3].

[1] https://tools.ietf.org/html/rfc2992
[2] https://github.com/idosch/linux/commits/submit/res_integ_v1
[3] https://github.com/idosch/iproute2/commits/submit/res_v1
====================

Link: https://lore.kernel.org/r/cover.1611836479.git.petrm@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
parents 4915a404 0bccf8ed
...@@ -4309,11 +4309,18 @@ static int mlxsw_sp_nexthop_obj_validate(struct mlxsw_sp *mlxsw_sp, ...@@ -4309,11 +4309,18 @@ static int mlxsw_sp_nexthop_obj_validate(struct mlxsw_sp *mlxsw_sp,
if (event != NEXTHOP_EVENT_REPLACE) if (event != NEXTHOP_EVENT_REPLACE)
return 0; return 0;
if (!info->is_grp) switch (info->type) {
case NH_NOTIFIER_INFO_TYPE_SINGLE:
return mlxsw_sp_nexthop_obj_single_validate(mlxsw_sp, info->nh, return mlxsw_sp_nexthop_obj_single_validate(mlxsw_sp, info->nh,
info->extack); info->extack);
return mlxsw_sp_nexthop_obj_group_validate(mlxsw_sp, info->nh_grp, case NH_NOTIFIER_INFO_TYPE_GRP:
info->extack); return mlxsw_sp_nexthop_obj_group_validate(mlxsw_sp,
info->nh_grp,
info->extack);
default:
NL_SET_ERR_MSG_MOD(info->extack, "Unsupported nexthop type");
return -EOPNOTSUPP;
}
} }
static bool mlxsw_sp_nexthop_obj_is_gateway(struct mlxsw_sp *mlxsw_sp, static bool mlxsw_sp_nexthop_obj_is_gateway(struct mlxsw_sp *mlxsw_sp,
...@@ -4321,13 +4328,17 @@ static bool mlxsw_sp_nexthop_obj_is_gateway(struct mlxsw_sp *mlxsw_sp, ...@@ -4321,13 +4328,17 @@ static bool mlxsw_sp_nexthop_obj_is_gateway(struct mlxsw_sp *mlxsw_sp,
{ {
const struct net_device *dev; const struct net_device *dev;
if (info->is_grp) switch (info->type) {
case NH_NOTIFIER_INFO_TYPE_SINGLE:
dev = info->nh->dev;
return info->nh->gw_family || info->nh->is_reject ||
mlxsw_sp_netdev_ipip_type(mlxsw_sp, dev, NULL);
case NH_NOTIFIER_INFO_TYPE_GRP:
/* Already validated earlier. */ /* Already validated earlier. */
return true; return true;
default:
dev = info->nh->dev; return false;
return info->nh->gw_family || info->nh->is_reject || }
mlxsw_sp_netdev_ipip_type(mlxsw_sp, dev, NULL);
} }
static void mlxsw_sp_nexthop_obj_blackhole_init(struct mlxsw_sp *mlxsw_sp, static void mlxsw_sp_nexthop_obj_blackhole_init(struct mlxsw_sp *mlxsw_sp,
...@@ -4410,11 +4421,22 @@ mlxsw_sp_nexthop_obj_group_info_init(struct mlxsw_sp *mlxsw_sp, ...@@ -4410,11 +4421,22 @@ mlxsw_sp_nexthop_obj_group_info_init(struct mlxsw_sp *mlxsw_sp,
struct mlxsw_sp_nexthop_group *nh_grp, struct mlxsw_sp_nexthop_group *nh_grp,
struct nh_notifier_info *info) struct nh_notifier_info *info)
{ {
unsigned int nhs = info->is_grp ? info->nh_grp->num_nh : 1;
struct mlxsw_sp_nexthop_group_info *nhgi; struct mlxsw_sp_nexthop_group_info *nhgi;
struct mlxsw_sp_nexthop *nh; struct mlxsw_sp_nexthop *nh;
unsigned int nhs;
int err, i; int err, i;
switch (info->type) {
case NH_NOTIFIER_INFO_TYPE_SINGLE:
nhs = 1;
break;
case NH_NOTIFIER_INFO_TYPE_GRP:
nhs = info->nh_grp->num_nh;
break;
default:
return -EINVAL;
}
nhgi = kzalloc(struct_size(nhgi, nexthops, nhs), GFP_KERNEL); nhgi = kzalloc(struct_size(nhgi, nexthops, nhs), GFP_KERNEL);
if (!nhgi) if (!nhgi)
return -ENOMEM; return -ENOMEM;
...@@ -4427,12 +4449,18 @@ mlxsw_sp_nexthop_obj_group_info_init(struct mlxsw_sp *mlxsw_sp, ...@@ -4427,12 +4449,18 @@ mlxsw_sp_nexthop_obj_group_info_init(struct mlxsw_sp *mlxsw_sp,
int weight; int weight;
nh = &nhgi->nexthops[i]; nh = &nhgi->nexthops[i];
if (info->is_grp) { switch (info->type) {
nh_obj = &info->nh_grp->nh_entries[i].nh; case NH_NOTIFIER_INFO_TYPE_SINGLE:
weight = info->nh_grp->nh_entries[i].weight;
} else {
nh_obj = info->nh; nh_obj = info->nh;
weight = 1; weight = 1;
break;
case NH_NOTIFIER_INFO_TYPE_GRP:
nh_obj = &info->nh_grp->nh_entries[i].nh;
weight = info->nh_grp->nh_entries[i].weight;
break;
default:
err = -EINVAL;
goto err_nexthop_obj_init;
} }
err = mlxsw_sp_nexthop_obj_init(mlxsw_sp, nh_grp, nh, nh_obj, err = mlxsw_sp_nexthop_obj_init(mlxsw_sp, nh_grp, nh, nh_obj,
weight); weight);
......
...@@ -860,7 +860,7 @@ static struct nsim_nexthop *nsim_nexthop_create(struct nsim_fib_data *data, ...@@ -860,7 +860,7 @@ static struct nsim_nexthop *nsim_nexthop_create(struct nsim_fib_data *data,
nexthop = kzalloc(sizeof(*nexthop), GFP_KERNEL); nexthop = kzalloc(sizeof(*nexthop), GFP_KERNEL);
if (!nexthop) if (!nexthop)
return NULL; return ERR_PTR(-ENOMEM);
nexthop->id = info->id; nexthop->id = info->id;
...@@ -868,15 +868,20 @@ static struct nsim_nexthop *nsim_nexthop_create(struct nsim_fib_data *data, ...@@ -868,15 +868,20 @@ static struct nsim_nexthop *nsim_nexthop_create(struct nsim_fib_data *data,
* occupy. * occupy.
*/ */
if (!info->is_grp) { switch (info->type) {
case NH_NOTIFIER_INFO_TYPE_SINGLE:
occ = 1; occ = 1;
goto out; break;
case NH_NOTIFIER_INFO_TYPE_GRP:
for (i = 0; i < info->nh_grp->num_nh; i++)
occ += info->nh_grp->nh_entries[i].weight;
break;
default:
NL_SET_ERR_MSG_MOD(info->extack, "Unsupported nexthop type");
kfree(nexthop);
return ERR_PTR(-EOPNOTSUPP);
} }
for (i = 0; i < info->nh_grp->num_nh; i++)
occ += info->nh_grp->nh_entries[i].weight;
out:
nexthop->occ = occ; nexthop->occ = occ;
return nexthop; return nexthop;
} }
...@@ -972,8 +977,8 @@ static int nsim_nexthop_insert(struct nsim_fib_data *data, ...@@ -972,8 +977,8 @@ static int nsim_nexthop_insert(struct nsim_fib_data *data,
int err; int err;
nexthop = nsim_nexthop_create(data, info); nexthop = nsim_nexthop_create(data, info);
if (!nexthop) if (IS_ERR(nexthop))
return -ENOMEM; return PTR_ERR(nexthop);
nexthop_old = rhashtable_lookup_fast(&data->nexthop_ht, &info->id, nexthop_old = rhashtable_lookup_fast(&data->nexthop_ht, &info->id,
nsim_nexthop_ht_params); nsim_nexthop_ht_params);
......
...@@ -66,7 +66,12 @@ struct nh_info { ...@@ -66,7 +66,12 @@ struct nh_info {
struct nh_grp_entry { struct nh_grp_entry {
struct nexthop *nh; struct nexthop *nh;
u8 weight; u8 weight;
atomic_t upper_bound;
union {
struct {
atomic_t upper_bound;
} mpath;
};
struct list_head nh_list; struct list_head nh_list;
struct nexthop *nh_parent; /* nexthop of group with this entry */ struct nexthop *nh_parent; /* nexthop of group with this entry */
...@@ -109,6 +114,11 @@ enum nexthop_event_type { ...@@ -109,6 +114,11 @@ enum nexthop_event_type {
NEXTHOP_EVENT_REPLACE, NEXTHOP_EVENT_REPLACE,
}; };
enum nh_notifier_info_type {
NH_NOTIFIER_INFO_TYPE_SINGLE,
NH_NOTIFIER_INFO_TYPE_GRP,
};
struct nh_notifier_single_info { struct nh_notifier_single_info {
struct net_device *dev; struct net_device *dev;
u8 gw_family; u8 gw_family;
...@@ -137,7 +147,7 @@ struct nh_notifier_info { ...@@ -137,7 +147,7 @@ struct nh_notifier_info {
struct net *net; struct net *net;
struct netlink_ext_ack *extack; struct netlink_ext_ack *extack;
u32 id; u32 id;
bool is_grp; enum nh_notifier_info_type type;
union { union {
struct nh_notifier_single_info *nh; struct nh_notifier_single_info *nh;
struct nh_notifier_grp_info *nh_grp; struct nh_notifier_grp_info *nh_grp;
......
This diff is collapsed.
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment