Commit 24f627a3 authored by Jakub Kicinski's avatar Jakub Kicinski

Merge branch 'implement-devlink-rate-api-and-extend-it'

Michal Wilczynski says:

====================
Implement devlink-rate API and extend it

This patch series implements devlink-rate for ice driver. Unfortunately
current API isn't flexible enough for our use case, so there is a need to
extend it. Some functions have been introduced to enable the driver to
export current Tx scheduling configuration.

Pasting justification for this series from commit implementing devlink-rate
in ice driver(that is a part of this series):

There is a need to support modification of Tx scheduler tree, in the
ice driver. This will allow user to control Tx settings of each node in
the internal hierarchy of nodes. As a result user will be able to use
Hierarchy QoS implemented entirely in the hardware.

This patch implemenents devlink-rate API. It also exports initial
default hierarchy. It's mostly dictated by the fact that the tree
can't be removed entirely, all we can do is enable the user to modify
it. For example root node shouldn't ever be removed, also nodes that
have children are off-limits.

Example initial tree with 2 VF's:

[root@fedora ~]# devlink port function rate show
pci/0000:4b:00.0/node_27: type node parent node_26
pci/0000:4b:00.0/node_26: type node parent node_0
pci/0000:4b:00.0/node_34: type node parent node_33
pci/0000:4b:00.0/node_33: type node parent node_32
pci/0000:4b:00.0/node_32: type node parent node_16
pci/0000:4b:00.0/node_19: type node parent node_18
pci/0000:4b:00.0/node_18: type node parent node_17
pci/0000:4b:00.0/node_17: type node parent node_16
pci/0000:4b:00.0/node_21: type node parent node_20
pci/0000:4b:00.0/node_20: type node parent node_3
pci/0000:4b:00.0/node_14: type node parent node_5
pci/0000:4b:00.0/node_5: type node parent node_3
pci/0000:4b:00.0/node_13: type node parent node_4
pci/0000:4b:00.0/node_12: type node parent node_4
pci/0000:4b:00.0/node_11: type node parent node_4
pci/0000:4b:00.0/node_10: type node parent node_4
pci/0000:4b:00.0/node_9: type node parent node_4
pci/0000:4b:00.0/node_8: type node parent node_4
pci/0000:4b:00.0/node_7: type node parent node_4
pci/0000:4b:00.0/node_6: type node parent node_4
pci/0000:4b:00.0/node_4: type node parent node_3
pci/0000:4b:00.0/node_3: type node parent node_16
pci/0000:4b:00.0/node_16: type node parent node_15
pci/0000:4b:00.0/node_15: type node parent node_0
pci/0000:4b:00.0/node_2: type node parent node_1
pci/0000:4b:00.0/node_1: type node parent node_0
pci/0000:4b:00.0/node_0: type node
pci/0000:4b:00.0/1: type leaf parent node_27
pci/0000:4b:00.0/2: type leaf parent node_27

Let me visualize part of the tree:

                        +---------+
                        |  node_0 |
                        +---------+
                             |
                        +----v----+
                        | node_26 |
                        +----+----+
                             |
                        +----v----+
                        | node_27 |
                        +----+----+
                             |
                    |-----------------|
               +----v----+       +----v----+
               |   VF 1  |       |   VF 2  |
               +----+----+       +----+----+

So at this point there is a couple things that can be done.
For example we could only assign parameters to VF's.

[root@fedora ~]# devlink port function rate set pci/0000:4b:00.0/1 \
                 tx_max 5Gbps

This would cap the VF 1 BW to 5Gbps.

But let's say you would like to create a completely new branch.
This can be done like this:

[root@fedora ~]# devlink port function rate add \
                 pci/0000:4b:00.0/node_custom parent node_0
[root@fedora ~]# devlink port function rate add \
                 pci/0000:4b:00.0/node_custom_1 parent node_custom
[root@fedora ~]# devlink port function rate set \
                 pci/0000:4b:00.0/1 parent node_custom_1

This creates a completely new branch and reassigns VF 1 to it.

A number of parameters is supported per each node: tx_max, tx_share,
tx_priority and tx_weight.
====================

Link: https://lore.kernel.org/r/20221115104825.172668-1-michal.wilczynski@intel.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
parents 4ab45e97 242dd643
......@@ -191,13 +191,44 @@ API allows to configure following rate object's parameters:
``tx_max``
Maximum TX rate value.
``tx_priority``
Allows for usage of strict priority arbiter among siblings. This
arbitration scheme attempts to schedule nodes based on their priority
as long as the nodes remain within their bandwidth limit. The higher the
priority the higher the probability that the node will get selected for
scheduling.
``tx_weight``
Allows for usage of Weighted Fair Queuing arbitration scheme among
siblings. This arbitration scheme can be used simultaneously with the
strict priority. As a node is configured with a higher rate it gets more
BW relative to it's siblings. Values are relative like a percentage
points, they basically tell how much BW should node take relative to
it's siblings.
``parent``
Parent node name. Parent node rate limits are considered as additional limits
to all node children limits. ``tx_max`` is an upper limit for children.
``tx_share`` is a total bandwidth distributed among children.
``tx_priority`` and ``tx_weight`` can be used simultaneously. In that case
nodes with the same priority form a WFQ subgroup in the sibling group
and arbitration among them is based on assigned weights.
Arbitration flow from the high level:
#. Choose a node, or group of nodes with the highest priority that stays
within the BW limit and are not blocked. Use ``tx_priority`` as a
parameter for this arbitration.
#. If group of nodes have the same priority perform WFQ arbitration on
that subgroup. Use ``tx_weight`` as a parameter for this arbitration.
#. Select the winner node, and continue arbitration flow among it's children,
until leaf node is reached, and the winner is established.
#. If all the nodes from the highest priority sub-group are satisfied, or
overused their assigned BW, move to the lower priority nodes.
Driver implementations are allowed to support both or either rate object types
and setting methods of their parameters.
and setting methods of their parameters. Additionally driver implementation
may export nodes/leafs and their child-parent relationships.
Terms and Definitions
=====================
......
......@@ -254,3 +254,118 @@ Users can request an immediate capture of a snapshot via the
0000000000000210 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
$ devlink region delete pci/0000:01:00.0/device-caps snapshot 1
Devlink Rate
============
The ``ice`` driver implements devlink-rate API. It allows for offload of
the Hierarchical QoS to the hardware. It enables user to group Virtual
Functions in a tree structure and assign supported parameters: tx_share,
tx_max, tx_priority and tx_weight to each node in a tree. So effectively
user gains an ability to control how much bandwidth is allocated for each
VF group. This is later enforced by the HW.
It is assumed that this feature is mutually exclusive with DCB performed
in FW and ADQ, or any driver feature that would trigger changes in QoS,
for example creation of the new traffic class. The driver will prevent DCB
or ADQ configuration if user started making any changes to the nodes using
devlink-rate API. To configure those features a driver reload is necessary.
Correspondingly if ADQ or DCB will get configured the driver won't export
hierarchy at all, or will remove the untouched hierarchy if those
features are enabled after the hierarchy is exported, but before any
changes are made.
This feature is also dependent on switchdev being enabled in the system.
It's required bacause devlink-rate requires devlink-port objects to be
present, and those objects are only created in switchdev mode.
If the driver is set to the switchdev mode, it will export internal
hierarchy the moment VF's are created. Root of the tree is always
represented by the node_0. This node can't be deleted by the user. Leaf
nodes and nodes with children also can't be deleted.
.. list-table:: Attributes supported
:widths: 15 85
* - Name
- Description
* - ``tx_max``
- maximum bandwidth to be consumed by the tree Node. Rate Limit is
an absolute number specifying a maximum amount of bytes a Node may
consume during the course of one second. Rate limit guarantees
that a link will not oversaturate the receiver on the remote end
and also enforces an SLA between the subscriber and network
provider.
* - ``tx_share``
- minimum bandwidth allocated to a tree node when it is not blocked.
It specifies an absolute BW. While tx_max defines the maximum
bandwidth the node may consume, the tx_share marks committed BW
for the Node.
* - ``tx_priority``
- allows for usage of strict priority arbiter among siblings. This
arbitration scheme attempts to schedule nodes based on their
priority as long as the nodes remain within their bandwidth limit.
Range 0-7. Nodes with priority 7 have the highest priority and are
selected first, while nodes with priority 0 have the lowest
priority. Nodes that have the same priority are treated equally.
* - ``tx_weight``
- allows for usage of Weighted Fair Queuing arbitration scheme among
siblings. This arbitration scheme can be used simultaneously with
the strict priority. Range 1-200. Only relative values mater for
arbitration.
``tx_priority`` and ``tx_weight`` can be used simultaneously. In that case
nodes with the same priority form a WFQ subgroup in the sibling group
and arbitration among them is based on assigned weights.
.. code:: shell
# enable switchdev
$ devlink dev eswitch set pci/0000:4b:00.0 mode switchdev
# at this point driver should export internal hierarchy
$ echo 2 > /sys/class/net/ens785np0/device/sriov_numvfs
$ devlink port function rate show
pci/0000:4b:00.0/node_25: type node parent node_24
pci/0000:4b:00.0/node_24: type node parent node_0
pci/0000:4b:00.0/node_32: type node parent node_31
pci/0000:4b:00.0/node_31: type node parent node_30
pci/0000:4b:00.0/node_30: type node parent node_16
pci/0000:4b:00.0/node_19: type node parent node_18
pci/0000:4b:00.0/node_18: type node parent node_17
pci/0000:4b:00.0/node_17: type node parent node_16
pci/0000:4b:00.0/node_14: type node parent node_5
pci/0000:4b:00.0/node_5: type node parent node_3
pci/0000:4b:00.0/node_13: type node parent node_4
pci/0000:4b:00.0/node_12: type node parent node_4
pci/0000:4b:00.0/node_11: type node parent node_4
pci/0000:4b:00.0/node_10: type node parent node_4
pci/0000:4b:00.0/node_9: type node parent node_4
pci/0000:4b:00.0/node_8: type node parent node_4
pci/0000:4b:00.0/node_7: type node parent node_4
pci/0000:4b:00.0/node_6: type node parent node_4
pci/0000:4b:00.0/node_4: type node parent node_3
pci/0000:4b:00.0/node_3: type node parent node_16
pci/0000:4b:00.0/node_16: type node parent node_15
pci/0000:4b:00.0/node_15: type node parent node_0
pci/0000:4b:00.0/node_2: type node parent node_1
pci/0000:4b:00.0/node_1: type node parent node_0
pci/0000:4b:00.0/node_0: type node
pci/0000:4b:00.0/1: type leaf parent node_25
pci/0000:4b:00.0/2: type leaf parent node_25
# let's create some custom node
$ devlink port function rate add pci/0000:4b:00.0/node_custom parent node_0
# second custom node
$ devlink port function rate add pci/0000:4b:00.0/node_custom_1 parent node_custom
# reassign second VF to newly created branch
$ devlink port function rate set pci/0000:4b:00.0/2 parent node_custom_1
# assign tx_weight to the VF
$ devlink port function rate set pci/0000:4b:00.0/2 tx_weight 5
# assign tx_share to the VF
$ devlink port function rate set pci/0000:4b:00.0/2 tx_share 500Mbps
......@@ -848,9 +848,9 @@ struct ice_aqc_txsched_elem {
u8 generic;
#define ICE_AQC_ELEM_GENERIC_MODE_M 0x1
#define ICE_AQC_ELEM_GENERIC_PRIO_S 0x1
#define ICE_AQC_ELEM_GENERIC_PRIO_M (0x7 << ICE_AQC_ELEM_GENERIC_PRIO_S)
#define ICE_AQC_ELEM_GENERIC_PRIO_M GENMASK(3, 1)
#define ICE_AQC_ELEM_GENERIC_SP_S 0x4
#define ICE_AQC_ELEM_GENERIC_SP_M (0x1 << ICE_AQC_ELEM_GENERIC_SP_S)
#define ICE_AQC_ELEM_GENERIC_SP_M GENMASK(4, 4)
#define ICE_AQC_ELEM_GENERIC_ADJUST_VAL_S 0x5
#define ICE_AQC_ELEM_GENERIC_ADJUST_VAL_M \
(0x3 << ICE_AQC_ELEM_GENERIC_ADJUST_VAL_S)
......
......@@ -1105,6 +1105,9 @@ int ice_init_hw(struct ice_hw *hw)
hw->evb_veb = true;
/* init xarray for identifying scheduling nodes uniquely */
xa_init_flags(&hw->port_info->sched_node_ids, XA_FLAGS_ALLOC);
/* Query the allocated resources for Tx scheduler */
status = ice_sched_query_res_alloc(hw);
if (status) {
......@@ -4600,7 +4603,7 @@ ice_ena_vsi_txq(struct ice_port_info *pi, u16 vsi_handle, u8 tc, u16 q_handle,
q_ctx->q_teid = le32_to_cpu(node.node_teid);
/* add a leaf node into scheduler tree queue layer */
status = ice_sched_add_node(pi, hw->num_tx_sched_layers - 1, &node);
status = ice_sched_add_node(pi, hw->num_tx_sched_layers - 1, &node, NULL);
if (!status)
status = ice_sched_replay_q_bw(pi, q_ctx);
......@@ -4835,7 +4838,7 @@ ice_ena_vsi_rdma_qset(struct ice_port_info *pi, u16 vsi_handle, u8 tc,
for (i = 0; i < num_qsets; i++) {
node.node_teid = buf->rdma_qsets[i].qset_teid;
ret = ice_sched_add_node(pi, hw->num_tx_sched_layers - 1,
&node);
&node, NULL);
if (ret)
break;
qset_teid[i] = le32_to_cpu(node.node_teid);
......
......@@ -1580,7 +1580,7 @@ ice_update_port_tc_tree_cfg(struct ice_port_info *pi,
/* new TC */
status = ice_sched_query_elem(pi->hw, teid2, &elem);
if (!status)
status = ice_sched_add_node(pi, 1, &elem);
status = ice_sched_add_node(pi, 1, &elem, NULL);
if (status)
break;
/* update the TC number */
......
......@@ -3,6 +3,7 @@
#include "ice_dcb_lib.h"
#include "ice_dcb_nl.h"
#include "ice_devlink.h"
/**
* ice_dcb_get_ena_tc - return bitmap of enabled TCs
......@@ -364,6 +365,12 @@ int ice_pf_dcb_cfg(struct ice_pf *pf, struct ice_dcbx_cfg *new_cfg, bool locked)
/* Enable DCB tagging only when more than one TC */
if (ice_dcb_get_num_tc(new_cfg) > 1) {
dev_dbg(dev, "DCB tagging enabled (num TC > 1)\n");
if (pf->hw.port_info->is_custom_tx_enabled) {
dev_err(dev, "Custom Tx scheduler feature enabled, can't configure DCB\n");
return -EBUSY;
}
ice_tear_down_devlink_rate_tree(pf);
set_bit(ICE_FLAG_DCB_ENA, pf->flags);
} else {
dev_dbg(dev, "DCB tagging disabled (num TC = 1)\n");
......
......@@ -18,4 +18,7 @@ void ice_devlink_destroy_vf_port(struct ice_vf *vf);
void ice_devlink_init_regions(struct ice_pf *pf);
void ice_devlink_destroy_regions(struct ice_pf *pf);
int ice_devlink_rate_init_tx_topology(struct devlink *devlink, struct ice_vsi *vsi);
void ice_tear_down_devlink_rate_tree(struct ice_pf *pf);
#endif /* _ICE_DEVLINK_H_ */
......@@ -8580,6 +8580,12 @@ static int ice_setup_tc_mqprio_qdisc(struct net_device *netdev, void *type_data)
switch (mode) {
case TC_MQPRIO_MODE_CHANNEL:
if (pf->hw.port_info->is_custom_tx_enabled) {
dev_err(dev, "Custom Tx scheduler feature enabled, can't configure ADQ\n");
return -EBUSY;
}
ice_tear_down_devlink_rate_tree(pf);
ret = ice_validate_mqprio_qopt(vsi, mqprio_qopt);
if (ret) {
netdev_err(netdev, "failed to validate_mqprio_qopt(), ret %d\n",
......
......@@ -6,6 +6,7 @@
#include "ice_devlink.h"
#include "ice_sriov.h"
#include "ice_tc_lib.h"
#include "ice_dcb_lib.h"
/**
* ice_repr_get_sw_port_id - get port ID associated with representor
......@@ -389,6 +390,7 @@ static void ice_repr_rem(struct ice_vf *vf)
*/
void ice_repr_rem_from_all_vfs(struct ice_pf *pf)
{
struct devlink *devlink;
struct ice_vf *vf;
unsigned int bkt;
......@@ -396,6 +398,14 @@ void ice_repr_rem_from_all_vfs(struct ice_pf *pf)
ice_for_each_vf(pf, bkt, vf)
ice_repr_rem(vf);
/* since all port representors are destroyed, there is
* no point in keeping the nodes
*/
devlink = priv_to_devlink(pf);
devl_lock(devlink);
devl_rate_nodes_destroy(devlink);
devl_unlock(devlink);
}
/**
......@@ -404,6 +414,7 @@ void ice_repr_rem_from_all_vfs(struct ice_pf *pf)
*/
int ice_repr_add_for_all_vfs(struct ice_pf *pf)
{
struct devlink *devlink;
struct ice_vf *vf;
unsigned int bkt;
int err;
......@@ -416,6 +427,13 @@ int ice_repr_add_for_all_vfs(struct ice_pf *pf)
goto err;
}
/* only export if ADQ and DCB disabled */
if (ice_is_adq_active(pf) || ice_is_dcb_active(pf))
return 0;
devlink = priv_to_devlink(pf);
ice_devlink_rate_init_tx_topology(devlink, ice_get_main_vsi(pf));
return 0;
err:
......
// SPDX-License-Identifier: GPL-2.0
/* Copyright (c) 2018, Intel Corporation. */
#include <net/devlink.h>
#include "ice_sched.h"
/**
......@@ -142,12 +143,14 @@ ice_aq_query_sched_elems(struct ice_hw *hw, u16 elems_req,
* @pi: port information structure
* @layer: Scheduler layer of the node
* @info: Scheduler element information from firmware
* @prealloc_node: preallocated ice_sched_node struct for SW DB
*
* This function inserts a scheduler node to the SW DB.
*/
int
ice_sched_add_node(struct ice_port_info *pi, u8 layer,
struct ice_aqc_txsched_elem_data *info)
struct ice_aqc_txsched_elem_data *info,
struct ice_sched_node *prealloc_node)
{
struct ice_aqc_txsched_elem_data elem;
struct ice_sched_node *parent;
......@@ -176,6 +179,9 @@ ice_sched_add_node(struct ice_port_info *pi, u8 layer,
if (status)
return status;
if (prealloc_node)
node = prealloc_node;
else
node = devm_kzalloc(ice_hw_to_dev(hw), sizeof(*node), GFP_KERNEL);
if (!node)
return -ENOMEM;
......@@ -355,6 +361,9 @@ void ice_free_sched_node(struct ice_port_info *pi, struct ice_sched_node *node)
/* leaf nodes have no children */
if (node->children)
devm_kfree(ice_hw_to_dev(hw), node->children);
kfree(node->name);
xa_erase(&pi->sched_node_ids, node->id);
devm_kfree(ice_hw_to_dev(hw), node);
}
......@@ -872,13 +881,15 @@ void ice_sched_cleanup_all(struct ice_hw *hw)
* @num_nodes: number of nodes
* @num_nodes_added: pointer to num nodes added
* @first_node_teid: if new nodes are added then return the TEID of first node
* @prealloc_nodes: preallocated nodes struct for software DB
*
* This function add nodes to HW as well as to SW DB for a given layer
*/
static int
int
ice_sched_add_elems(struct ice_port_info *pi, struct ice_sched_node *tc_node,
struct ice_sched_node *parent, u8 layer, u16 num_nodes,
u16 *num_nodes_added, u32 *first_node_teid)
u16 *num_nodes_added, u32 *first_node_teid,
struct ice_sched_node **prealloc_nodes)
{
struct ice_sched_node *prev, *new_node;
struct ice_aqc_add_elem *buf;
......@@ -924,7 +935,11 @@ ice_sched_add_elems(struct ice_port_info *pi, struct ice_sched_node *tc_node,
*num_nodes_added = num_nodes;
/* add nodes to the SW DB */
for (i = 0; i < num_nodes; i++) {
status = ice_sched_add_node(pi, layer, &buf->generic[i]);
if (prealloc_nodes)
status = ice_sched_add_node(pi, layer, &buf->generic[i], prealloc_nodes[i]);
else
status = ice_sched_add_node(pi, layer, &buf->generic[i], NULL);
if (status) {
ice_debug(hw, ICE_DBG_SCHED, "add nodes in SW DB failed status =%d\n",
status);
......@@ -940,6 +955,22 @@ ice_sched_add_elems(struct ice_port_info *pi, struct ice_sched_node *tc_node,
new_node->sibling = NULL;
new_node->tc_num = tc_node->tc_num;
new_node->tx_weight = ICE_SCHED_DFLT_BW_WT;
new_node->tx_share = ICE_SCHED_DFLT_BW;
new_node->tx_max = ICE_SCHED_DFLT_BW;
new_node->name = kzalloc(SCHED_NODE_NAME_MAX_LEN, GFP_KERNEL);
if (!new_node->name)
return -ENOMEM;
status = xa_alloc(&pi->sched_node_ids, &new_node->id, NULL, XA_LIMIT(0, UINT_MAX),
GFP_KERNEL);
if (status) {
ice_debug(hw, ICE_DBG_SCHED, "xa_alloc failed for sched node status =%d\n",
status);
break;
}
snprintf(new_node->name, SCHED_NODE_NAME_MAX_LEN, "node_%u", new_node->id);
/* add it to previous node sibling pointer */
/* Note: siblings are not linked across branches */
......@@ -1003,7 +1034,7 @@ ice_sched_add_nodes_to_hw_layer(struct ice_port_info *pi,
}
return ice_sched_add_elems(pi, tc_node, parent, layer, num_nodes,
num_nodes_added, first_node_teid);
num_nodes_added, first_node_teid, NULL);
}
/**
......@@ -1268,7 +1299,7 @@ int ice_sched_init_port(struct ice_port_info *pi)
ICE_AQC_ELEM_TYPE_ENTRY_POINT)
hw->sw_entry_point_layer = j;
status = ice_sched_add_node(pi, j, &buf[i].generic[j]);
status = ice_sched_add_node(pi, j, &buf[i].generic[j], NULL);
if (status)
goto err_init_port;
}
......@@ -2154,7 +2185,7 @@ ice_sched_get_free_vsi_parent(struct ice_hw *hw, struct ice_sched_node *node,
* This function removes the child from the old parent and adds it to a new
* parent
*/
static void
void
ice_sched_update_parent(struct ice_sched_node *new_parent,
struct ice_sched_node *node)
{
......@@ -2188,7 +2219,7 @@ ice_sched_update_parent(struct ice_sched_node *new_parent,
*
* This function move the child nodes to a given parent.
*/
static int
int
ice_sched_move_nodes(struct ice_port_info *pi, struct ice_sched_node *parent,
u16 num_items, u32 *list)
{
......@@ -3560,7 +3591,7 @@ ice_sched_set_eir_srl_excl(struct ice_port_info *pi,
* node's RL profile ID of type CIR, EIR, or SRL, and removes old profile
* ID from local database. The caller needs to hold scheduler lock.
*/
static int
int
ice_sched_set_node_bw(struct ice_port_info *pi, struct ice_sched_node *node,
enum ice_rl_type rl_type, u32 bw, u8 layer_num)
{
......@@ -3596,6 +3627,57 @@ ice_sched_set_node_bw(struct ice_port_info *pi, struct ice_sched_node *node,
ICE_AQC_RL_PROFILE_TYPE_M, old_id);
}
/**
* ice_sched_set_node_priority - set node's priority
* @pi: port information structure
* @node: tree node
* @priority: number 0-7 representing priority among siblings
*
* This function sets priority of a node among it's siblings.
*/
int
ice_sched_set_node_priority(struct ice_port_info *pi, struct ice_sched_node *node,
u16 priority)
{
struct ice_aqc_txsched_elem_data buf;
struct ice_aqc_txsched_elem *data;
buf = node->info;
data = &buf.data;
data->valid_sections |= ICE_AQC_ELEM_VALID_GENERIC;
data->generic |= FIELD_PREP(ICE_AQC_ELEM_GENERIC_PRIO_M, priority);
return ice_sched_update_elem(pi->hw, node, &buf);
}
/**
* ice_sched_set_node_weight - set node's weight
* @pi: port information structure
* @node: tree node
* @weight: number 1-200 representing weight for WFQ
*
* This function sets weight of the node for WFQ algorithm.
*/
int
ice_sched_set_node_weight(struct ice_port_info *pi, struct ice_sched_node *node, u16 weight)
{
struct ice_aqc_txsched_elem_data buf;
struct ice_aqc_txsched_elem *data;
buf = node->info;
data = &buf.data;
data->valid_sections = ICE_AQC_ELEM_VALID_CIR | ICE_AQC_ELEM_VALID_EIR |
ICE_AQC_ELEM_VALID_GENERIC;
data->cir_bw.bw_alloc = cpu_to_le16(weight);
data->eir_bw.bw_alloc = cpu_to_le16(weight);
data->generic |= FIELD_PREP(ICE_AQC_ELEM_GENERIC_SP_M, 0x0);
return ice_sched_update_elem(pi->hw, node, &buf);
}
/**
* ice_sched_set_node_bw_lmt - set node's BW limit
* @pi: port information structure
......@@ -3606,7 +3688,7 @@ ice_sched_set_node_bw(struct ice_port_info *pi, struct ice_sched_node *node,
* It updates node's BW limit parameters like BW RL profile ID of type CIR,
* EIR, or SRL. The caller needs to hold scheduler lock.
*/
static int
int
ice_sched_set_node_bw_lmt(struct ice_port_info *pi, struct ice_sched_node *node,
enum ice_rl_type rl_type, u32 bw)
{
......
......@@ -6,6 +6,8 @@
#include "ice_common.h"
#define SCHED_NODE_NAME_MAX_LEN 32
#define ICE_QGRP_LAYER_OFFSET 2
#define ICE_VSI_LAYER_OFFSET 4
#define ICE_AGG_LAYER_OFFSET 6
......@@ -69,6 +71,29 @@ int
ice_aq_query_sched_elems(struct ice_hw *hw, u16 elems_req,
struct ice_aqc_txsched_elem_data *buf, u16 buf_size,
u16 *elems_ret, struct ice_sq_cd *cd);
int
ice_sched_set_node_bw_lmt(struct ice_port_info *pi, struct ice_sched_node *node,
enum ice_rl_type rl_type, u32 bw);
int
ice_sched_set_node_bw(struct ice_port_info *pi, struct ice_sched_node *node,
enum ice_rl_type rl_type, u32 bw, u8 layer_num);
int
ice_sched_add_elems(struct ice_port_info *pi, struct ice_sched_node *tc_node,
struct ice_sched_node *parent, u8 layer, u16 num_nodes,
u16 *num_nodes_added, u32 *first_node_teid,
struct ice_sched_node **prealloc_node);
int
ice_sched_move_nodes(struct ice_port_info *pi, struct ice_sched_node *parent,
u16 num_items, u32 *list);
int ice_sched_set_node_priority(struct ice_port_info *pi, struct ice_sched_node *node,
u16 priority);
int ice_sched_set_node_weight(struct ice_port_info *pi, struct ice_sched_node *node, u16 weight);
int ice_sched_init_port(struct ice_port_info *pi);
int ice_sched_query_res_alloc(struct ice_hw *hw);
void ice_sched_get_psm_clk_freq(struct ice_hw *hw);
......@@ -81,7 +106,11 @@ struct ice_sched_node *
ice_sched_find_node_by_teid(struct ice_sched_node *start_node, u32 teid);
int
ice_sched_add_node(struct ice_port_info *pi, u8 layer,
struct ice_aqc_txsched_elem_data *info);
struct ice_aqc_txsched_elem_data *info,
struct ice_sched_node *prealloc_node);
void
ice_sched_update_parent(struct ice_sched_node *new_parent,
struct ice_sched_node *node);
void ice_free_sched_node(struct ice_port_info *pi, struct ice_sched_node *node);
struct ice_sched_node *ice_sched_get_tc_node(struct ice_port_info *pi, u8 tc);
struct ice_sched_node *
......
......@@ -524,7 +524,14 @@ struct ice_sched_node {
struct ice_sched_node *sibling; /* next sibling in the same layer */
struct ice_sched_node **children;
struct ice_aqc_txsched_elem_data info;
char *name;
struct devlink_rate *rate_node;
u64 tx_max;
u64 tx_share;
u32 agg_id; /* aggregator group ID */
u32 id;
u32 tx_priority;
u32 tx_weight;
u16 vsi_handle;
u8 in_use; /* suspended or in use */
u8 tx_sched_layer; /* Logical Layer (1-9) */
......@@ -706,7 +713,9 @@ struct ice_port_info {
/* List contain profile ID(s) and other params per layer */
struct list_head rl_prof_list[ICE_AQC_TOPO_MAX_LEVEL_NUM];
struct ice_qos_cfg qos_cfg;
struct xarray sched_node_ids;
u8 is_vf:1;
u8 is_custom_tx_enabled:1;
};
struct ice_switch_info {
......
......@@ -91,7 +91,7 @@ int mlx5_esw_offloads_devlink_port_register(struct mlx5_eswitch *esw, u16 vport_
if (err)
goto reg_err;
err = devl_rate_leaf_create(dl_port, vport);
err = devl_rate_leaf_create(dl_port, vport, NULL);
if (err)
goto rate_err;
......@@ -160,7 +160,7 @@ int mlx5_esw_devlink_sf_port_register(struct mlx5_eswitch *esw, struct devlink_p
if (err)
return err;
err = devl_rate_leaf_create(dl_port, vport);
err = devl_rate_leaf_create(dl_port, vport, NULL);
if (err)
goto rate_err;
......
......@@ -1401,7 +1401,7 @@ static int __nsim_dev_port_add(struct nsim_dev *nsim_dev, enum nsim_dev_port_typ
if (nsim_dev_port_is_vf(nsim_dev_port)) {
err = devl_rate_leaf_create(&nsim_dev_port->devlink_port,
nsim_dev_port);
nsim_dev_port, NULL);
if (err)
goto err_nsim_destroy;
}
......
......@@ -114,6 +114,9 @@ struct devlink_rate {
refcount_t refcnt;
};
};
u32 tx_priority;
u32 tx_weight;
};
struct devlink_port {
......@@ -1511,10 +1514,18 @@ struct devlink_ops {
u64 tx_share, struct netlink_ext_ack *extack);
int (*rate_leaf_tx_max_set)(struct devlink_rate *devlink_rate, void *priv,
u64 tx_max, struct netlink_ext_ack *extack);
int (*rate_leaf_tx_priority_set)(struct devlink_rate *devlink_rate, void *priv,
u32 tx_priority, struct netlink_ext_ack *extack);
int (*rate_leaf_tx_weight_set)(struct devlink_rate *devlink_rate, void *priv,
u32 tx_weight, struct netlink_ext_ack *extack);
int (*rate_node_tx_share_set)(struct devlink_rate *devlink_rate, void *priv,
u64 tx_share, struct netlink_ext_ack *extack);
int (*rate_node_tx_max_set)(struct devlink_rate *devlink_rate, void *priv,
u64 tx_max, struct netlink_ext_ack *extack);
int (*rate_node_tx_priority_set)(struct devlink_rate *devlink_rate, void *priv,
u32 tx_priority, struct netlink_ext_ack *extack);
int (*rate_node_tx_weight_set)(struct devlink_rate *devlink_rate, void *priv,
u32 tx_weight, struct netlink_ext_ack *extack);
int (*rate_node_new)(struct devlink_rate *rate_node, void **priv,
struct netlink_ext_ack *extack);
int (*rate_node_del)(struct devlink_rate *rate_node, void *priv,
......@@ -1606,7 +1617,12 @@ void devlink_port_attrs_pci_vf_set(struct devlink_port *devlink_port, u32 contro
void devlink_port_attrs_pci_sf_set(struct devlink_port *devlink_port,
u32 controller, u16 pf, u32 sf,
bool external);
int devl_rate_leaf_create(struct devlink_port *port, void *priv);
struct devlink_rate *
devl_rate_node_create(struct devlink *devlink, void *priv, char *node_name,
struct devlink_rate *parent);
int
devl_rate_leaf_create(struct devlink_port *devlink_port, void *priv,
struct devlink_rate *parent);
void devl_rate_leaf_destroy(struct devlink_port *devlink_port);
void devl_rate_nodes_destroy(struct devlink *devlink);
void devlink_port_linecard_set(struct devlink_port *devlink_port,
......
......@@ -607,6 +607,9 @@ enum devlink_attr {
DEVLINK_ATTR_SELFTESTS, /* nested */
DEVLINK_ATTR_RATE_TX_PRIORITY, /* u32 */
DEVLINK_ATTR_RATE_TX_WEIGHT, /* u32 */
/* add new attributes above here, update the policy in devlink.c */
__DEVLINK_ATTR_MAX,
......
......@@ -1203,6 +1203,14 @@ static int devlink_nl_rate_fill(struct sk_buff *msg,
devlink_rate->tx_max, DEVLINK_ATTR_PAD))
goto nla_put_failure;
if (nla_put_u32(msg, DEVLINK_ATTR_RATE_TX_PRIORITY,
devlink_rate->tx_priority))
goto nla_put_failure;
if (nla_put_u32(msg, DEVLINK_ATTR_RATE_TX_WEIGHT,
devlink_rate->tx_weight))
goto nla_put_failure;
if (devlink_rate->parent)
if (nla_put_string(msg, DEVLINK_ATTR_RATE_PARENT_NODE_NAME,
devlink_rate->parent->name))
......@@ -1879,10 +1887,8 @@ devlink_nl_rate_parent_node_set(struct devlink_rate *devlink_rate,
int err = -EOPNOTSUPP;
parent = devlink_rate->parent;
if (parent && len) {
NL_SET_ERR_MSG_MOD(info->extack, "Rate object already has parent.");
return -EBUSY;
} else if (parent && !len) {
if (parent && !len) {
if (devlink_rate_is_leaf(devlink_rate))
err = ops->rate_leaf_parent_set(devlink_rate, NULL,
devlink_rate->priv, NULL,
......@@ -1896,7 +1902,7 @@ devlink_nl_rate_parent_node_set(struct devlink_rate *devlink_rate,
refcount_dec(&parent->refcnt);
devlink_rate->parent = NULL;
} else if (!parent && len) {
} else if (len) {
parent = devlink_rate_node_get_by_name(devlink, parent_name);
if (IS_ERR(parent))
return -ENODEV;
......@@ -1923,6 +1929,10 @@ devlink_nl_rate_parent_node_set(struct devlink_rate *devlink_rate,
if (err)
return err;
if (devlink_rate->parent)
/* we're reassigning to other parent in this case */
refcount_dec(&devlink_rate->parent->refcnt);
refcount_inc(&parent->refcnt);
devlink_rate->parent = parent;
}
......@@ -1936,6 +1946,8 @@ static int devlink_nl_rate_set(struct devlink_rate *devlink_rate,
{
struct nlattr *nla_parent, **attrs = info->attrs;
int err = -EOPNOTSUPP;
u32 priority;
u32 weight;
u64 rate;
if (attrs[DEVLINK_ATTR_RATE_TX_SHARE]) {
......@@ -1964,6 +1976,34 @@ static int devlink_nl_rate_set(struct devlink_rate *devlink_rate,
devlink_rate->tx_max = rate;
}
if (attrs[DEVLINK_ATTR_RATE_TX_PRIORITY]) {
priority = nla_get_u32(attrs[DEVLINK_ATTR_RATE_TX_PRIORITY]);
if (devlink_rate_is_leaf(devlink_rate))
err = ops->rate_leaf_tx_priority_set(devlink_rate, devlink_rate->priv,
priority, info->extack);
else if (devlink_rate_is_node(devlink_rate))
err = ops->rate_node_tx_priority_set(devlink_rate, devlink_rate->priv,
priority, info->extack);
if (err)
return err;
devlink_rate->tx_priority = priority;
}
if (attrs[DEVLINK_ATTR_RATE_TX_WEIGHT]) {
weight = nla_get_u32(attrs[DEVLINK_ATTR_RATE_TX_WEIGHT]);
if (devlink_rate_is_leaf(devlink_rate))
err = ops->rate_leaf_tx_weight_set(devlink_rate, devlink_rate->priv,
weight, info->extack);
else if (devlink_rate_is_node(devlink_rate))
err = ops->rate_node_tx_weight_set(devlink_rate, devlink_rate->priv,
weight, info->extack);
if (err)
return err;
devlink_rate->tx_weight = weight;
}
nla_parent = attrs[DEVLINK_ATTR_RATE_PARENT_NODE_NAME];
if (nla_parent) {
err = devlink_nl_rate_parent_node_set(devlink_rate, info,
......@@ -1995,6 +2035,18 @@ static bool devlink_rate_set_ops_supported(const struct devlink_ops *ops,
NL_SET_ERR_MSG_MOD(info->extack, "Parent set isn't supported for the leafs");
return false;
}
if (attrs[DEVLINK_ATTR_RATE_TX_PRIORITY] && !ops->rate_leaf_tx_priority_set) {
NL_SET_ERR_MSG_ATTR(info->extack,
attrs[DEVLINK_ATTR_RATE_TX_PRIORITY],
"TX priority set isn't supported for the leafs");
return false;
}
if (attrs[DEVLINK_ATTR_RATE_TX_WEIGHT] && !ops->rate_leaf_tx_weight_set) {
NL_SET_ERR_MSG_ATTR(info->extack,
attrs[DEVLINK_ATTR_RATE_TX_WEIGHT],
"TX weight set isn't supported for the leafs");
return false;
}
} else if (type == DEVLINK_RATE_TYPE_NODE) {
if (attrs[DEVLINK_ATTR_RATE_TX_SHARE] && !ops->rate_node_tx_share_set) {
NL_SET_ERR_MSG_MOD(info->extack, "TX share set isn't supported for the nodes");
......@@ -2009,6 +2061,18 @@ static bool devlink_rate_set_ops_supported(const struct devlink_ops *ops,
NL_SET_ERR_MSG_MOD(info->extack, "Parent set isn't supported for the nodes");
return false;
}
if (attrs[DEVLINK_ATTR_RATE_TX_PRIORITY] && !ops->rate_node_tx_priority_set) {
NL_SET_ERR_MSG_ATTR(info->extack,
attrs[DEVLINK_ATTR_RATE_TX_PRIORITY],
"TX priority set isn't supported for the nodes");
return false;
}
if (attrs[DEVLINK_ATTR_RATE_TX_WEIGHT] && !ops->rate_node_tx_weight_set) {
NL_SET_ERR_MSG_ATTR(info->extack,
attrs[DEVLINK_ATTR_RATE_TX_WEIGHT],
"TX weight set isn't supported for the nodes");
return false;
}
} else {
WARN(1, "Unknown type of rate object");
return false;
......@@ -9187,6 +9251,8 @@ static const struct nla_policy devlink_nl_policy[DEVLINK_ATTR_MAX + 1] = {
[DEVLINK_ATTR_LINECARD_INDEX] = { .type = NLA_U32 },
[DEVLINK_ATTR_LINECARD_TYPE] = { .type = NLA_NUL_STRING },
[DEVLINK_ATTR_SELFTESTS] = { .type = NLA_NESTED },
[DEVLINK_ATTR_RATE_TX_PRIORITY] = { .type = NLA_U32 },
[DEVLINK_ATTR_RATE_TX_WEIGHT] = { .type = NLA_U32 },
};
static const struct genl_small_ops devlink_nl_ops[] = {
......@@ -10320,14 +10386,61 @@ void devlink_port_attrs_pci_sf_set(struct devlink_port *devlink_port, u32 contro
}
EXPORT_SYMBOL_GPL(devlink_port_attrs_pci_sf_set);
/**
* devl_rate_node_create - create devlink rate node
* @devlink: devlink instance
* @priv: driver private data
* @node_name: name of the resulting node
* @parent: parent devlink_rate struct
*
* Create devlink rate object of type node
*/
struct devlink_rate *
devl_rate_node_create(struct devlink *devlink, void *priv, char *node_name,
struct devlink_rate *parent)
{
struct devlink_rate *rate_node;
rate_node = devlink_rate_node_get_by_name(devlink, node_name);
if (!IS_ERR(rate_node))
return ERR_PTR(-EEXIST);
rate_node = kzalloc(sizeof(*rate_node), GFP_KERNEL);
if (!rate_node)
return ERR_PTR(-ENOMEM);
if (parent) {
rate_node->parent = parent;
refcount_inc(&rate_node->parent->refcnt);
}
rate_node->type = DEVLINK_RATE_TYPE_NODE;
rate_node->devlink = devlink;
rate_node->priv = priv;
rate_node->name = kstrdup(node_name, GFP_KERNEL);
if (!rate_node->name) {
kfree(rate_node);
return ERR_PTR(-ENOMEM);
}
refcount_set(&rate_node->refcnt, 1);
list_add(&rate_node->list, &devlink->rate_list);
devlink_rate_notify(rate_node, DEVLINK_CMD_RATE_NEW);
return rate_node;
}
EXPORT_SYMBOL_GPL(devl_rate_node_create);
/**
* devl_rate_leaf_create - create devlink rate leaf
* @devlink_port: devlink port object to create rate object on
* @priv: driver private data
* @parent: parent devlink_rate struct
*
* Create devlink rate object of type leaf on provided @devlink_port.
*/
int devl_rate_leaf_create(struct devlink_port *devlink_port, void *priv)
int devl_rate_leaf_create(struct devlink_port *devlink_port, void *priv,
struct devlink_rate *parent)
{
struct devlink *devlink = devlink_port->devlink;
struct devlink_rate *devlink_rate;
......@@ -10341,6 +10454,11 @@ int devl_rate_leaf_create(struct devlink_port *devlink_port, void *priv)
if (!devlink_rate)
return -ENOMEM;
if (parent) {
devlink_rate->parent = parent;
refcount_inc(&devlink_rate->parent->refcnt);
}
devlink_rate->type = DEVLINK_RATE_TYPE_LEAF;
devlink_rate->devlink = devlink;
devlink_rate->devlink_port = devlink_port;
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment