• Petr Machata's avatar
    net: nexthop: Increase weight to u16 · b72a6a7a
    Petr Machata authored
    In CLOS networks, as link failures occur at various points in the network,
    ECMP weights of the involved nodes are adjusted to compensate. With high
    fan-out of the involved nodes, and overall high number of nodes,
    a (non-)ECMP weight ratio that we would like to configure does not fit into
    8 bits. Instead of, say, 255:254, we might like to configure something like
    1000:999. For these deployments, the 8-bit weight may not be enough.
    
    To that end, in this patch increase the next hop weight from u8 to u16.
    
    Increasing the width of an integral type can be tricky, because while the
    code still compiles, the types may not check out anymore, and numerical
    errors come up. To prevent this, the conversion was done in two steps.
    First the type was changed from u8 to a single-member structure, which
    invalidated all uses of the field. This allowed going through them one by
    one and audit for type correctness. Then the structure was replaced with a
    vanilla u16 again. This should ensure that no place was missed.
    
    The UAPI for configuring nexthop group members is that an attribute
    NHA_GROUP carries an array of struct nexthop_grp entries:
    
    	struct nexthop_grp {
    		__u32	id;	  /* nexthop id - must exist */
    		__u8	weight;   /* weight of this nexthop */
    		__u8	resvd1;
    		__u16	resvd2;
    	};
    
    The field resvd1 is currently validated and required to be zero. We can
    lift this requirement and carry high-order bits of the weight in the
    reserved field:
    
    	struct nexthop_grp {
    		__u32	id;	  /* nexthop id - must exist */
    		__u8	weight;   /* weight of this nexthop */
    		__u8	weight_high;
    		__u16	resvd2;
    	};
    
    Keeping the fields split this way was chosen in case an existing userspace
    makes assumptions about the width of the weight field, and to sidestep any
    endianness issues.
    
    The weight field is currently encoded as the weight value minus one,
    because weight of 0 is invalid. This same trick is impossible for the new
    weight_high field, because zero must mean actual zero. With this in place:
    
    - Old userspace is guaranteed to carry weight_high of 0, therefore
      configuring 8-bit weights as appropriate. When dumping nexthops with
      16-bit weight, it would only show the lower 8 bits. But configuring such
      nexthops implies existence of userspace aware of the extension in the
      first place.
    
    - New userspace talking to an old kernel will work as long as it only
      attempts to configure 8-bit weights, where the high-order bits are zero.
      Old kernel will bounce attempts at configuring >8-bit weights.
    
    Renaming reserved fields as they are allocated for some purpose is commonly
    done in Linux. Whoever touches a reserved field is doing so at their own
    risk. nexthop_grp::resvd1 in particular is currently used by at least
    strace, however they carry an own copy of UAPI headers, and the conversion
    should be trivial. A helper is provided for decoding the weight out of the
    two fields. Forcing a conversion seems preferable to bending backwards and
    introducing anonymous unions or whatever.
    Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
    Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
    Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
    Reviewed-by: default avatarPrzemek Kitszel <przemyslaw.kitszel@intel.com>
    Link: https://patch.msgid.link/483e2fcf4beb0d9135d62e7d27b46fa2685479d4.1723036486.git.petrm@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
    b72a6a7a
nexthop.h 12.7 KB