Commits · 6c085a8aab5183d8658c9a692bcfda3e24195b7a · Kirill Smelkov / linux

01 Aug, 2019 39 commits

net/mlx5e: XDP, Close TX MPWQE session when no room for inline packet left · 6c085a8a

Shay Agroskin authored May 12, 2019

In MPWQE mode, when transmitting packets with XDP, a packet that is smaller
than a certain size (set to 256 bytes) would be sent inline within its WQE
TX descriptor (mem-copied), in case the hardware tx queue is congested
beyond a pre-defined water-mark.

If a MPWQE cannot contain an additional inline packet, we close this
MPWQE session, and send the packet inlined within the next MPWQE.
To save some MPWQE session close+open operations, we don't open MPWQE
sessions that are contiguously smaller than certain size (set to the
HW MPWQE maximum size). If there isn't enough contiguous room in the
send queue, we fill it with NOPs and wrap the send queue index around.

This way, qualified packets are always sent inline.

Perf tests:
Tested packet rate for UDP 64Byte multi-stream
over two dual port ConnectX-5 100Gbps NICs.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz

XDP_TX:

With 24 channels:
| ------ | bounced packets | inlined packets | inline ratio |
| before | 113.6Mpps       | 96.3Mpps        | 84%          |
| after  |   115Mpps       | 99.5Mpps        | 86%          |

With one channel:

| ------ | bounced packets | inlined packets | inline ratio |
| before | 6.7Mpps         | 0pps            | 0%           |
| after  | 6.8Mpps         | 0pps            | 0%           |

As we can see, there is improvement in both inline ratio and overall
packet rate for 24 channels. Also, we see no degradation for the
one-channel case.
Signed-off-by: Shay Agroskin <shayag@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

6c085a8a

net/mlx5e: Tx, Strict the room needed for SQ edge NOPs · 68865419

Tariq Toukan authored Jul 11, 2019

We use NOPs to populate the WQ fragment edge if the WQE does not fit
in frag, to avoid WQEs crossing a page boundary (or wrap-around the WQ).

The upper bound on the needed number of NOPs is one WQEBB less than
the largest possible WQE, for otherwise the WQE would certainly fit.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

68865419

net/mlx5: Add flow counter pool · 558101f1

Gavi Teitz authored Jun 27, 2019

Add a pool of flow counters, based on flow counter bulks, removing the
need to allocate a new counter via a costly FW command during the flow
creation process. The time it takes to acquire/release a flow counter
is cut from ~50 [us] to ~50 [ns].

The pool is part of the mlx5 driver instance, and provides flow
counters for aging flows. mlx5_fc_create() was modified to provide
counters for aging flows from the pool by default, and
mlx5_destroy_fc() was modified to release counters back to the pool
for later reuse. If bulk allocation is not supported or fails, and for
non-aging flows, the fallback behavior is to allocate and free
individual counters.

The pool is comprised of three lists of flow counter bulks, one of
fully used bulks, one of partially used bulks, and one of unused
bulks. Counters are provided from the partially used bulks first, to
help limit bulk fragmentation.

The pool maintains a threshold, and strives to maintain the amount of
available counters below it. The pool is increased in size when a
counter acquisition request is made and there are no available
counters, and it is decreased in size when the last counter in a bulk
is released and there are more available counters than the threshold.
All pool size changes are done in the context of the
acquiring/releasing process.

The value of the threshold is directly correlated to the amount of
used counters the pool is providing, while constrained by a hard
maximum, and is recalculated every time a bulk is allocated/freed.
This ensures that the pool only consumes large amounts of memory for
available counters if the pool is being used heavily. When fully
populated and at the hard maximum, the buffer of available counters
consumes ~40 [MB].
Signed-off-by: Gavi Teitz <gavi@mellanox.com>
Reviewed-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

558101f1

net/mlx5: Add flow counter bulk infrastructure · 5d8a0253

Gavi Teitz authored Jun 27, 2019

Add infrastructure to track bulks of flow counters, providing
the means to allocate and deallocate bulks, and to acquire and
release individual counters from the bulks.
Signed-off-by: Gavi Teitz <gavi@mellanox.com>
Reviewed-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

5d8a0253

net/mlx5: E-Switch, add ingress rate support · fcb64c0f

Eli Cohen authored May 08, 2019

Use the scheduling elements to implement ingress rate limiter on an
eswitch ports ingress traffic. Since the ingress of eswitch port is the
egress of VF port, we control eswitch ingress by controlling VF egress.

Configuration is done using the ports' representor net devices.

Please note that burst size configuration is not supported by devices
ConnectX-5 and earlier generations.

Configuration examples:
tc:
tc filter add dev enp59s0f0_0 root protocol ip matchall action police rate 1mbit burst 20k

ovs:
ovs-vsctl set interface eth0 ingress_policing_rate=1000
Signed-off-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Paul Blakey <paulb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

fcb64c0f

Merge branch 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux · 68e18626

Saeed Mahameed authored Aug 01, 2019

Misc updates from mlx5-next branch.

1) Eli improves the handling of the support for QoS element type
2) Gavi refactors and prepares mlx5 flow counters for bulk allocation
support
3) Parav, refactors and improves E-Switch load/unload flows
4) Saeed, two misc cleanups
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

68e18626

net/mlx5: E-switch, Tide up eswitch config sequence · 5896b972

Parav Pandit authored Jul 29, 2019

Currently for PF and ECPF vports, representors are created before
their eswitch hardware ports are initialized in below flow.

mlx5_eswitch_enable()
  esw_offloads_init()
    esw_offloads_load_all_reps()
[..]
esw_enable_vport()

However for VFs, vports are initialized before creating their
respective netdev represnetors in event handling context.

Similarly while disabling eswitch, first hardware vports are disabled,
followed by destroying their representors.
Here while underlying vports gets destroyed but its respective user
facing netdevice can still exist on which user can continue to perform
more offload operations.

Instead, its more accurate to do
enable_eswitch switchdev mode:
1. perform FDB tables initialization
2. initialize hw vport
3. create and publish representor for this vport

disable_eswitch switchdev mode:
1. destroy user facing representor for the vport
2. disable hw vport
3. perform FDB tables cleanup
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

5896b972

net/mlx5: E-Switch, Remove redundant mc_promisc NULL check · 131ce701

Parav Pandit authored Jul 29, 2019

mc_promisc pointer points to an instance of struct esw_mc_addr allocated
as part of the esw structure.
Hence it cannot be NULL.
Removed such redundant check and assign where it is actually used.

While at it, add comment around legacy mode fields and move mc_promisc
close to other legacy mode structures to improve code redability.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

131ce701

net/mlx5: E-Switch, remove redundant error handling · 9ddb830a

Saeed Mahameed authored Jul 29, 2019

We don't need to handle error flow of esw_create_legacy_table() in the
same branch, it is already being handled directly after the if statement,
for both legacy and switchdev modes in one place.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

9ddb830a

net/mlx5: E-switch, Introduce helper function to enable/disable vports · 5019833d

Parav Pandit authored Jul 29, 2019

vports needs to be enabled in switchdev and legacy mode.

In switchdev mode, vports should be enabled after initializing
the FDB tables and before creating their represntors so that
representor works on an initialized vport object.

Prepare a helper function which can be called when enabling either of
the eswitch modes.

Similarly, have disable vports helper function.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

5019833d

net/mlx5: E-switch, Initialize TSAR Qos hardware block before its user vports · 610090eb

Parav Pandit authored Jul 29, 2019

First enable TSAR Qos hardware block in device before enabling its
user vports.

This refactor is needed so that vports can be enabled before their
representor netdevice can be created.

While at it, esw_create_tsar() returns error code which was used only to
print error. However esw_create_tsar() already prints warning if it hits
an error.
Hence, remove the redundant warning.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

610090eb

net/mlx5: E-switch, Combine metadata enable/disable functionality · 332bd3a5

Parav Pandit authored Jul 29, 2019

Except bit toggling code, rest of the code is same to enable/disable
metadata passing functionality.
Hence, combine them to single function and control using enable flag.

Also instead of checking metadata supported at multiple places,
fold into the helper function.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

332bd3a5

net/mlx5: E-Switch, Verify support QoS element type · 6cedde45

Eli Cohen authored Jul 29, 2019

Check if firmware supports the requested element type before
attempting to create the element type.
In addition, explicitly specify the request element type and tsar type.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Paul Blakey <paulb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

6cedde45

net/mlx5: Make load_one() and unload_one() symmetric · 0000a5f2

Parav Pandit authored Jul 29, 2019

Currently mlx5_load_one() perform device registration using
mlx5_register_device(). But mlx5_unload_one() doesn't unregister.

Make them symmetric by doing device unregistration in
mlx5_unload_one().
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

0000a5f2

net/mlx5: Fix offset of tisc bits reserved field · 7761f9ee

Saeed Mahameed authored Jul 29, 2019

First reserved field is off by one instead of reserved_at_1 it should be
reserved_at_2, fix that.

Fixes: a12ff35e ("net/mlx5: Introduce TLS TX offload hardware bits and structures")
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

7761f9ee

net/mlx5: Add flow counter bulk allocation hardware bits and command · 8536a6bf

Gavi Teitz authored Jul 29, 2019

Add a handle to invoke the new FW capability of allocating a bulk of
flow counters.
Signed-off-by: Gavi Teitz <gavi@mellanox.com>
Reviewed-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

8536a6bf

net/mlx5: Refactor and optimize flow counter bulk query · 6f06e04b

Gavi Teitz authored Jul 29, 2019

Towards introducing the ability to allocate bulks of flow counters,
refactor the flow counter bulk query process, removing functions and
structs whose names indicated being used for flow counter bulk
allocation FW commands, despite them actually only being used to
support bulk querying, and migrate their functionality to correctly
named functions in their natural location, fs_counters.c.

Additionally, optimize the bulk query process by:
 * Extracting the memory used for the query to mlx5_fc_stats so
   that it is only allocated once, and not for each bulk query.
 * Querying all the counters in one function call.
Signed-off-by: Gavi Teitz <gavi@mellanox.com>
Reviewed-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

6f06e04b

Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue · a8e600e2

David S. Miller authored Aug 01, 2019

Jeff Kirsher says:

====================
100GbE Intel Wired LAN Driver Updates 2019-07-31

This series contains updates to ice driver only.

Paul adds support for reporting what the link partner is advertising for
flow control settings.

Jake fixes the hardware statistics register which is prone to rollover
since the statistic registers are either 32 or 40 bits wide, depending
on which register is being read.  So use a 64 bit software statistic to
store off the hardware statistics to track past when it rolls over.
Fixes an issue with the locking of the control queue, where locks were
being destroyed at run time.

Tony fixes an issue that was created when interrupt tracking was
refactored and the call to ice_vsi_setup_vector_base() was removed from
the PF VSI instead of the VF VSI.  Adds a check before trying to
configure a port to ensure that media is attached.

Brett fixes an issue in the receive queue configuration where prefena
(Prefetch Enable) was being set to 0 which caused the hardware to only
fetch descriptors when there are none free in the cache for a received
packet.  Updates the driver to only bump the receive tail once per
napi_poll call, instead of the current model of bumping the tail up to 4
times per napi_poll call.  Adds statistics for receive drops at the port
level to ethtool/netlink.  Cleans up duplicate code in the allocation of
receive buffer code.

Akeem updates the driver to ensure that VFs stay disabled until the
setup or reset is completed.  Modifies the driver to use the allocated
number of transmit queues per VSI to set up the scheduling tree versus
using the total number of available transmit queues.  Also fix the
driver to update the total number of configured queues, after a
successful VF request to change its number of queues before updating the
corresponding VSI for that VF.  Cleaned up unnecessary flags that are no
longer needed.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

a8e600e2

Merge branch 'net-hns3-some-code-optimizations-bugfixes-features' · 9b59e39f

David S. Miller authored Aug 01, 2019

Huazhong Tan says:

====================
net: hns3: some code optimizations & bugfixes & features

This patch-set includes code optimizations, bugfixes and features for
the HNS3 ethernet controller driver.

[patch 01/12] adds support for reporting link change event.

[patch 02/12] adds handler for NCSI error.

[patch 03/12] fixes bug related to debugfs.

[patch 04/12] adds a code optimization for setting ring parameters.

[patch 05/12 - 09/12] adds some cleanups.

[patch 10/12 - 12/12] adds some patches related to reset issue.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

9b59e39f

net: hns3: activate reset timer when calling reset_event · 012fcb52

Huazhong Tan authored Aug 01, 2019

When calling hclge_reset_event() within HCLGE_RESET_INTERVAL,
it returns directly now. If no one call it again, then the
error which needs a reset to fix it can not be fixed.

So this patch activates the reset timer for this case, and
adds checking in the end of the reset procedure to make this
error fixed earlier.
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Reviewed-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

012fcb52

net: hns3: clear reset interrupt status in hclge_irq_handle() · 72e2fb07

Huazhong Tan authored Aug 01, 2019

Currently, the reset interrupt is cleared in the reset task, which
is too late. Since, when the hardware finish the previous reset,
it can begin to do a new global/IMP reset, if this new coming reset
type is same as the previous one, the driver will clear them together,
then driver can not get that there is another reset, but the hardware
still wait for the driver to deal with the second one.

So this patch clears PF's reset interrupt status in the
hclge_irq_handle(), the hardware waits for handshaking from
driver before doing reset, so the driver and hardware deal with reset
one by one.

BTW, when VF doing global/IMP reset, it reads PF's reset interrupt
register to get that whether PF driver's re-initialization is done,
since VF's re-initialization should be done after PF's. So we add
a new command and a register bit to do that. When VF receive reset
interrupt, it sets up this bit, and PF finishes re-initialization
send command to clear this bit, then VF do re-initialization.

Fixes: 4ed340ab ("net: hns3: Add reset process in hclge_main")
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Reviewed-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

72e2fb07

net: hns3: fix some reset handshake issue · 6b428b4f

Huazhong Tan authored Aug 01, 2019

Currently, the driver sets handshake status to tell the hardware
that the driver have downed the netdev and it can continue with
reset process. The driver will clear the handshake status when
re-initializing the CMDQ, and does not recover this status
when reset fail, which may cause the hardware to wait for
the handshake status to be set and not being able to continue
with reset process.

So this patch delays clearing handshake status just before UP,
and recovers this status when reset fail.

BTW, this patch adds a new function hclge(vf)_reset_handshake() to
deal with the reset handshake issue, and renames
HCLGE(VF)_NIC_CMQ_ENABLE to HCLGE(VF)_NIC_SW_RST_RDY which
represents this register bit more accurately.

Fixes: ada13ee3 ("net: hns3: add handshake with hardware while doing reset")
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Reviewed-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6b428b4f

net: hns3: rename a member in struct hclge_mac_ethertype_idx_rd_cmd · 6e6e7680

Guojia Liao authored Aug 01, 2019

The member 'mac_add' defined in hclge_mac_ethertype_idx_rd_cmd
means MAC address, so 'mac_addr' is a better name for it.
Signed-off-by: Guojia Liao <liaoguojia@huawei.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6e6e7680

net: hns3: simplify hclge_cmd_query_error() · dbae56a3

Weihang Li authored Aug 01, 2019

The 4th and 5th parameter of hclge_cmd_query_error is useless, so this
patch removes them.
Signed-off-by: Weihang Li <liweihang@hisilicon.com>
Reviewed-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dbae56a3

net: hns3: minior error handling change for hclge_tm_schd_info_init · b6872fd3

Yunsheng Lin authored Aug 01, 2019

When hclge_tm_schd_info_update calls hclge_tm_schd_info_init to
initialize the schedule info, hdev->tm_info.num_pg and
hdev->tx_sch_mode is not changed, which makes the checking in
hclge_tm_schd_info_init unnecessary.

So this patch moves the hdev->tm_info.num_pg and hdev->tx_sch_mode
checking into hclge_tm_schd_init and changes the return type of
hclge_tm_schd_info_init from int to void.
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Reviewed-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b6872fd3

net: hns3: minor cleanup in hns3_clean_rx_ring · a4ee7624

Yunsheng Lin authored Aug 01, 2019

The unused_count variable is used to indicate how many
RX BD need attaching new buffer in hns3_clean_rx_ring,
and the clean_count variable has the similar meaning.

This patch removes the clean_count variable and use
unused_count to uniformly indicate the RX BD that need
attaching new buffer.

This patch also clean up some coding style related to
variable assignment in hns3_clean_rx_ring.
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Reviewed-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a4ee7624

net: hns3: remove unnecessary variable in hclge_get_mac_vlan_cmd_status() · 6e4139f6

Jian Shen authored Aug 01, 2019

The local variable return_status in hclge_get_mac_val_cmd_status()
is useless. So this patch returns the error code directly, instead of
using this variable. Also, replace some '%d' with '%u' in
hclge_get_mac_val_cmd_status().
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Reviewed-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6e4139f6

net: hns3: refine for set ring parameters · a723fb8e

Jian Shen authored Aug 01, 2019

Previously, when changing the ring parameters, we free the old
ring resources firstly, and then setup the new ring resources.
In some case of an memory allocation fail, there will be no
resources to use. This patch refines it by setup new ring
resources and free the old ring resources in order.

Also reduce the max ring BD number to 32760 according to UM.
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Reviewed-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a723fb8e

net: hns3: do not query unsupported commands in debugfs · 3f0f3253

Yufeng Mo authored Aug 01, 2019

Some commands are not supported on DCB-unsupported ports.
This patch distinguishes these commands and does not query
unsupported commands in debugfs.

This patch also fix an error in the dump "qos buf cfg"
command in debugfs.

Fixes: 2849d4e7 ("net: hns3: Add "tc config" info query function")
Fixes: 7d9d7f88 ("net: hns3: Add "qos buffer" config info query function")
Signed-off-by: Yufeng Mo <moyufeng@huawei.com>
Reviewed-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3f0f3253

net: hns3: add handler for NCSI error mailbox · b18bf305

Huazhong Tan authored Aug 01, 2019

When NCSI has HW error, the IMP will report this error to the driver
by sending a mailbox. After received this message, the driver should
assert a global reset to fix this kind of HW error.
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Reviewed-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b18bf305

net: hns3: add link change event report · ed8fb4b2

Jian Shen authored Aug 01, 2019

Previously, PF updates link status per second. For some scenario,
it requires link down event being reported more quickly.
To solve it, firmware pushes the link change event to PF with
CMDQ message, and driver updates the link status directly.
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Reviewed-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ed8fb4b2