Commits · e1b5683ff62e7b328317aec08869495992053e9d · Kirill Smelkov / linux

25 Aug, 2021 30 commits

net: mana: Move NAPI from EQ to CQ · e1b5683f

Haiyang Zhang authored Aug 24, 2021

The existing code has NAPI threads polling on EQ directly. To prepare
for EQ sharing among vPorts, move NAPI from EQ to CQ so that one EQ
can serve multiple CQs from different vPorts.

The "arm bit" is only set when CQ processing is completed to reduce
the number of EQ entries, which in turn reduce the number of interrupts
on EQ.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e1b5683f

netxen_nic: Remove the repeated declaration · 807d1032

Shaokun Zhang authored Aug 25, 2021

Function 'netxen_rom_fast_read' is declared twice, so remove the
repeated declaration.

Cc: Manish Chopra <manishc@marvell.com>
Cc: Rahul Verma <rahulv@marvell.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

807d1032

cxgb4: Properly revert VPD changes · bc4f128d

Nathan Chancellor authored Aug 24, 2021

Clang warns:

drivers/net/ethernet/chelsio/cxgb4/t4_hw.c:2785:2: error: variable 'kw_offset' is uninitialized when used here [-Werror,-Wuninitialized]
        FIND_VPD_KW(i, "RV");
        ^~~~~~~~~~~~~~~~~~~~
drivers/net/ethernet/chelsio/cxgb4/t4_hw.c:2776:39: note: expanded from macro 'FIND_VPD_KW'
        var = pci_vpd_find_info_keyword(vpd, kw_offset, vpdr_len, name); \
                                             ^~~~~~~~~
drivers/net/ethernet/chelsio/cxgb4/t4_hw.c:2748:34: note: initialize the variable 'kw_offset' to silence this warning
        unsigned int vpdr_len, kw_offset, id_len;
                                        ^
                                         = 0
drivers/net/ethernet/chelsio/cxgb4/t4_hw.c:2785:2: error: variable 'vpdr_len' is uninitialized when used here [-Werror,-Wuninitialized]
        FIND_VPD_KW(i, "RV");
        ^~~~~~~~~~~~~~~~~~~~
drivers/net/ethernet/chelsio/cxgb4/t4_hw.c:2776:50: note: expanded from macro 'FIND_VPD_KW'
        var = pci_vpd_find_info_keyword(vpd, kw_offset, vpdr_len, name); \
                                                        ^~~~~~~~
drivers/net/ethernet/chelsio/cxgb4/t4_hw.c:2748:23: note: initialize the variable 'vpdr_len' to silence this warning
        unsigned int vpdr_len, kw_offset, id_len;
                             ^
                              = 0
2 errors generated.

The series "PCI/VPD: Convert more users to the new VPD API functions"
was applied to net-next when it should have been applied to the PCI tree
because of build errors. However, commit 82e34c8a ("Revert "Revert
"cxgb4: Search VPD with pci_vpd_find_ro_info_keyword()""") reapplied a
change, resulting in the warning above.

Properly revert commit 8d63ee60 ("cxgb4: Search VPD with
pci_vpd_find_ro_info_keyword()") to fix the warning and restore proper
functionality. This also reverts commit 3a93bede ("cxgb4: Remove
unused vpd_param member ec") to avoid future merge conflicts, as that
change has been applied to the PCI tree.

Link: https://lore.kernel.org/r/20210823120929.7c6f7a4f@canb.auug.org.au/
Link: https://lore.kernel.org/r/1ca29408-7bc7-4da5-59c7-87893c9e0442@gmail.com/Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

bc4f128d

Merge branch 'mptcp-next' · cb0f8b03

David S. Miller authored Aug 25, 2021

Mat Martineau says:

====================
mptcp: Optimize output options and add MP_FAIL

This patch set contains two groups of changes that we've been testing in
the MPTCP tree.

The first optimizes the code path and data structure for populating
MPTCP option headers when transmitting.

Patch 1 reorganizes code to reduce the number of conditionals that need
to be evaluated in common cases.

Patch 2 rearranges struct mptcp_out_options to save 80 bytes (on x86_64).

The next five patches add partial support for the MP_FAIL option as
defined in RFC 8684. MP_FAIL is an option header used to cleanly handle
MPTCP checksum failures. When the MPTCP checksum detects an error in the
MPTCP DSS header or the data mapped by that header, the receiver uses a
TCP RST with MP_FAIL to close the subflow that experienced the error and
provide associated MPTCP sequence number information to the peer. RFC
8684 also describes how a single-subflow connection can discard corrupt
data and remain connected under certain conditions using MP_FAIL, but
that feature is not implemented here.

Patches 3-5 implement MP_FAIL transmit and receive, and integrates with
checksum validation.

Patches 6 & 7 add MP_FAIL selftests and the MIBs required for those
tests.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

cb0f8b03

selftests: mptcp: add MP_FAIL mibs check · 6bb3ab49

Geliang Tang authored Aug 24, 2021

This patch added a function chk_fail_nr to check the mibs for MP_FAIL.
Signed-off-by: Geliang Tang <geliangtang@xiaomi.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6bb3ab49

mptcp: add the mibs for MP_FAIL · eb7f3365

Geliang Tang authored Aug 24, 2021

This patch added the mibs for MP_FAIL: MPTCP_MIB_MPFAILTX and
MPTCP_MIB_MPFAILRX.
Signed-off-by: Geliang Tang <geliangtang@xiaomi.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

eb7f3365

mptcp: send out MP_FAIL when data checksum fails · 478d7700

Geliang Tang authored Aug 24, 2021

When a bad checksum is detected, set the send_mp_fail flag to send out
the MP_FAIL option.

Add a new function mptcp_has_another_subflow() to check whether there's
only a single subflow.

When multiple subflows are in use, close the affected subflow with a RST
that includes an MP_FAIL option and discard the data with the bad
checksum.

Set the sk_state of the subsocket to TCP_CLOSE, then the flag
MPTCP_WORK_CLOSE_SUBFLOW will be set in subflow_sched_work_if_closed,
and the subflow will be closed.

When a single subfow is in use, temporarily handled by sending MP_FAIL
with a RST too.
Signed-off-by: Geliang Tang <geliangtang@xiaomi.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

478d7700

mptcp: MP_FAIL suboption receiving · 5580d41b

Geliang Tang authored Aug 24, 2021

This patch added handling for receiving MP_FAIL suboption.

Add a new members mp_fail and fail_seq in struct mptcp_options_received.
When MP_FAIL suboption is received, set mp_fail to 1 and save the sequence
number to fail_seq.

Then invoke mptcp_pm_mp_fail_received to deal with the MP_FAIL suboption.
Signed-off-by: Geliang Tang <geliangtang@xiaomi.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5580d41b

mptcp: MP_FAIL suboption sending · c25aeb4e

Geliang Tang authored Aug 24, 2021

This patch added the MP_FAIL suboption sending support.

Add a new flag named send_mp_fail in struct mptcp_subflow_context. If
this flag is set, send out MP_FAIL suboption.

Add a new member fail_seq in struct mptcp_out_options to save the data
sequence number to put into the MP_FAIL suboption.

An MP_FAIL option could be included in a RST or on the subflow-level
ACK.
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@xiaomi.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c25aeb4e

mptcp: shrink mptcp_out_options struct · d7b26908

Paolo Abeni authored Aug 24, 2021

After the previous patch we can alias with a union several
fields in mptcp_out_options. Such struct is stack allocated and
memset() for each plain TCP out packet. Every saved byted counts.

Before:
pahole -EC mptcp_out_options
 # ...
/* size: 136, cachelines: 3, members: 17 */

After:
pahole -EC mptcp_out_options
 # ...
/* size: 56, cachelines: 1, members: 9 */
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d7b26908

mptcp: optimize out option generation · 1bff1e43

Paolo Abeni authored Aug 24, 2021

Currently we have several protocol constraints on MPTCP options
generation (e.g. MPC and MPJ subopt are mutually exclusive)
and some additional ones required by our implementation
(e.g. almost all ADD_ADDR variant are mutually exclusive with
everything else).

We can leverage the above to optimize the out option generation:
we check DSS/MPC/MPJ presence in a mutually exclusive way,
avoiding many unneeded conditionals in the common cases.

Additionally extend the existing constraints on ADD_ADDR opt on
all subvariants, so that it becomes fully mutually exclusive with
the above and we can skip another conditional statement for the
common case.

This change is also needed by the next patch.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1bff1e43

Merge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue · d484dc2b

David S. Miller authored Aug 25, 2021

Tony Nguyen says:

====================
1GbE Intel Wired LAN Driver Updates 2021-08-24

Vinicius Costa Gomes says:

This adds support for PCIe PTM (Precision Time Measurement) to the igc
driver. PCIe PTM allows the NIC and Host clocks to be compared more
precisely, improving the clock synchronization accuracy.

Patch 1/4 reverts a commit that made pci_enable_ptm() private to the
PCI subsystem, reverting makes it possible for it to be called from
the drivers.

Patch 2/4 adds the pcie_ptm_enabled() helper.

Patch 3/4 calls pci_enable_ptm() from the igc driver.

Patch 4/4 implements the PCIe PTM support. Exposing it via the
.getcrosststamp() API implies that the time measurements are made
synchronously with the ioctl(). The hardware was implemented so the
most convenient way to retrieve that information would be
asynchronously. So, to follow the expectations of the ioctl() we have
to use less convenient ways, triggering an PCIe PTM dialog every time
a ioctl() is received.

Some questions are raised (also pointed out in the commit message):

1. Using convert_art_ns_to_tsc() is too x86 specific, there should be
   a common way to create a 'system_counterval_t' from a timestamp.

2. convert_art_ns_to_tsc() says that it should only be used when
   X86_FEATURE_TSC_KNOWN_FREQ is true, but during tests it works even
   when it returns false. Should that check be done?
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

d484dc2b

Merge branch 'lan7800-improvements' · 38cbd6e7

David S. Miller authored Aug 25, 2021

John Efstathiades says:

====================
LAN7800 driver improvements

This patch set introduces a number of improvements and fixes for
problems found during testing of a modification to add a NAPI-style
approach to packet handling to improve performance.

NOTE: the NAPI changes are not part of this patch set and the issues
      fixed by this patch set are not coupled to the NAPI changes.

Patch 1 fixes white space and style issues

Patch 2 removes an unused timer

Patch 3 introduces macros to set the internal packet FIFO flow
control levels, which makes it easier to update the levels in future.

Patch 4 removes an unused queue

Patch 5 (updated for v2) introduces function return value checks and
error propagation to various parts of the driver where a return
code was captured but then ignored.

This patch is completely different to patch 5 in version 1 of this patch
set. The changes in the v1 patch 5 are being set aside for the time
being.

Patch 6 updates the LAN7800 MAC reset code to ensure there is no
PHY register access in progress when the MAC is reset. This change
prevents a kernel exception that can otherwise occur.

Patch 7 fixes problems with system suspend and resume handling while
the device is transmitting and receiving data.

Patch 8 fixes problems with auto-suspend and resume handling and
depends on changes introduced by patch 7.

Patch 9 fixes problems with device disconnect handling that can result
in kernel exceptions and/or hang.

Patch 10 limits the rate at which driver warning messages are emitted.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

38cbd6e7

lan78xx: Limit number of driver warning messages · df0d6f7a

John Efstathiades authored Aug 24, 2021

Device removal can result in a large burst of driver warning messages
(20 - 30) sent to the kernel log. Most of these are register read/write
failures.

This change limits the rate at which these messages are emitted.
Signed-off-by: John Efstathiades <john.efstathiades@pebblebay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

df0d6f7a

lan78xx: Fix race condition in disconnect handling · 77dfff5b

John Efstathiades authored Aug 24, 2021

If there is a device disconnect at roughly the same time as a
deferred PHY link reset there is a race condition that can result
in a kernel lock up due to a null pointer dereference in the
driver's deferred work handling routine lan78xx_delayedwork().
The following changes fix this problem.

Add new status flag EVENT_DEV_DISCONNECT to indicate when the
device has been removed and use it to prevent operations, such as
register access, that will fail once the device is removed.

Stop processing of deferred work items when the driver's USB
disconnect handler is invoked.

Disconnect the PHY only after the network device has been
unregistered and all delayed work has been cancelled.
Signed-off-by: John Efstathiades <john.efstathiades@pebblebay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

77dfff5b

lan78xx: Fix race conditions in suspend/resume handling · 5f4cc6e2

John Efstathiades authored Aug 24, 2021

If the interface is given an IP address while the device is
suspended (as a result of an auto-suspend event) there is a race
between lan78xx_resume() and lan78xx_open() that can result in an
exception or failure to handle incoming packets. The following
changes fix this problem.

Introduce a mutex to serialise operations in the network interface
open and stop entry points with respect to the USB driver suspend
and resume entry points.

Move Tx and Rx data path start/stop to lan78xx_start() and
lan78xx_stop() respectively and flush the packet FIFOs before
starting the Tx and Rx data paths. This prevents the MAC and FIFOs
getting out of step and delivery of malformed packets to the network
stack.

Stop processing of received packets before disconnecting the
PHY from the MAC to prevent a kernel exception caused by handling
packets after the PHY device has been removed.

Refactor device auto-suspend code to make it consistent with the
the system suspend code and make the suspend handler easier to read.

Add new code to stop wake-on-lan packets or PHY events resuming the
host or device from suspend if the device has not been opened
(typically after an IP address is assigned).

This patch is dependent on changes to lan78xx_suspend() and
lan78xx_resume() introduced in the previous patch of this patch set.
Signed-off-by: John Efstathiades <john.efstathiades@pebblebay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5f4cc6e2

lan78xx: Fix partial packet errors on suspend/resume · e1210fe6

John Efstathiades authored Aug 24, 2021

The MAC can get out of step with the internal packet FIFOs if the
system goes to sleep when the link is active, especially at high
data rates. This can result in partial frames in the packet FIFOs
that in result in malformed frames being delivered to the host.
This occurs because the driver does not enable/disable the internal
packet FIFOs in step with the corresponding MAC data path. The
following changes fix this problem.

Update code that enables/disables the MAC receiver and transmitter
to the more general Rx and Tx data path, where the data path in each
direction consists of both the MAC function (Tx or Rx) and the
corresponding packet FIFO.

In the receive path the packet FIFO must be enabled before the MAC
receiver but disabled after the MAC receiver.

In the transmit path the opposite is true: the packet FIFO must be
enabled after the MAC transmitter but disabled before the MAC
transmitter.

The packet FIFOs can be flushed safely once the corresponding data
path is stopped.
Signed-off-by: John Efstathiades <john.efstathiades@pebblebay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e1210fe6

lan78xx: Fix exception on link speed change · b1f6696d

John Efstathiades authored Aug 24, 2021

An exception is sometimes seen when the link speed is changed
from auto-negotiation to a fixed speed, or vice versa. The
exception occurs when the MAC is reset (due to the link speed
change) at the same time as the PHY state machine is accessing
a PHY register. The following changes fix this problem.

Rework the MAC reset to ensure there is no outstanding MDIO
register transaction before the reset and then wait until the
reset is complete before allowing any further MAC register access.
Signed-off-by: John Efstathiades <john.efstathiades@pebblebay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b1f6696d

lan78xx: Add missing return code checks · 3415f6ba

John Efstathiades authored Aug 24, 2021

There are many places in the driver where the return code from a
function call is captured but without a subsequent test of the
return code and appropriate action taken.

This patch adds the missing return code tests and action. In most
cases the action is an early exit from the calling function.

The function lan78xx_set_suspend() was also updated to make it
consistent with lan78xx_suspend().
Signed-off-by: John Efstathiades <john.efstathiades@pebblebay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3415f6ba

lan78xx: Remove unused pause frame queue · 40b8452f

John Efstathiades authored Aug 24, 2021

Remove the pause frame queue from the driver. It is initialised
but not actually used.
Signed-off-by: John Efstathiades <john.efstathiades@pebblebay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

40b8452f

lan78xx: Set flow control threshold to prevent packet loss · dc35f854

John Efstathiades authored Aug 24, 2021

Set threshold at which flow control is triggered to 3/4 full of
the internal Rx packet FIFO to prevent packet drops at high data
rates. The new setting reduces the number of dropped UDP frames
and TCP retransmit requests especially on less capable CPUs.
Signed-off-by: John Efstathiades <john.efstathiades@pebblebay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dc35f854

lan78xx: Remove unused timer · 3bef6b9e

John Efstathiades authored Aug 24, 2021

Remove kernel timer that is not used by the driver.
Signed-off-by: John Efstathiades <john.efstathiades@pebblebay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3bef6b9e

lan78xx: Fix white space and style issues · 9ceec7d3

John Efstathiades authored Aug 24, 2021

Fix white space and code style issues identified by checkpatch.
Signed-off-by: John Efstathiades <john.efstathiades@pebblebay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9ceec7d3

Merge branch 'xen-harden-netfront' · fbd029df

David S. Miller authored Aug 25, 2021

Juergen Gross says:

====================
xen: harden netfront against malicious backends

Xen backends of para-virtualized devices can live in dom0 kernel, dom0
user land, or in a driver domain. This means that a backend might
reside in a less trusted environment than the Xen core components, so
a backend should not be able to do harm to a Xen guest (it can still
mess up I/O data, but it shouldn't be able to e.g. crash a guest by
other means or cause a privilege escalation in the guest).

Unfortunately netfront in the Linux kernel is fully trusting its
backend. This series is fixing netfront in this regard.

It was discussed to handle this as a security problem, but the topic
was discussed in public before, so it isn't a real secret.

It should be mentioned that a similar series has been posted some years
ago by Marek Marczykowski-Górecki, but this series has not been applied
due to a Xen header not having been available in the Xen git repo at
that time. Additionally my series is fixing some more DoS cases.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

fbd029df

xen/netfront: don't trust the backend response data blindly · a884daa6

Juergen Gross authored Aug 24, 2021

Today netfront will trust the backend to send only sane response data.
In order to avoid privilege escalations or crashes in case of malicious
backends verify the data to be within expected limits. Especially make
sure that the response always references an outstanding request.

Note that only the tx queue needs special id handling, as for the rx
queue the id is equal to the index in the ring page.

Introduce a new indicator for the device whether it is broken and let
the device stop working when it is set. Set this indicator in case the
backend sets any weird data.
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a884daa6

xen/netfront: disentangle tx_skb_freelist · 21631d2d

Juergen Gross authored Aug 24, 2021

The tx_skb_freelist elements are in a single linked list with the
request id used as link reference. The per element link field is in a
union with the skb pointer of an in use request.

Move the link reference out of the union in order to enable a later
reuse of it for requests which need a populated skb pointer.

Rename add_id_to_freelist() and get_id_from_freelist() to
add_id_to_list() and get_id_from_list() in order to prepare using
those for other lists as well. Define ~0 as value to indicate the end
of a list and place that value into the link for a request not being
on the list.

When freeing a skb zero the skb pointer in the request. Use a NULL
value of the skb pointer instead of skb_entry_is_link() for deciding
whether a request has a skb linked to it.

Remove skb_entry_set_link() and open code it instead as it is really
trivial now.
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

21631d2d

xen/netfront: don't read data from request on the ring page · 162081ec

Juergen Gross authored Aug 24, 2021

In order to avoid a malicious backend being able to influence the local
processing of a request build the request locally first and then copy
it to the ring page. Any reading from the request influencing the
processing in the frontend needs to be done on the local instance.
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

162081ec

xen/netfront: read response from backend only once · 8446066b

Juergen Gross authored Aug 24, 2021

In order to avoid problems in case the backend is modifying a response
on the ring page while the frontend has already seen it, just read the
response into a local buffer in one go and then operate on that buffer
only.
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8446066b

qed: Enable automatic recovery on error condition. · 755f9053

Alok Prasad authored Aug 24, 2021

This patch enables automatic recovery by default in case of various
error condition like fw assert , hardware error etc.
This also ensure driver can handle multiple iteration of assertion
conditions.
Signed-off-by: Ariel Elior <aelior@marvell.com>
Signed-off-by: Shai Malin <smalin@marvell.com>
Signed-off-by: Igor Russkikh <irusskikh@marvell.com>
Signed-off-by: Alok Prasad <palok@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

755f9053

net-next: When a bond have a massive amount of VLANs with IPv6 addresses,... · 406f42fa

Gilad Naaman authored Aug 19, 2021

net-next: When a bond have a massive amount of VLANs with IPv6 addresses, performance of changing link state, attaching a VRF, changing an IPv6 address, etc. go down dramtically.

The source of most of the slow down is the `dev_addr_lists.c` module,
which mainatins a linked list of HW addresses.
When using IPv6, this list grows for each IPv6 address added on a
VLAN, since each IPv6 address has a multicast HW address associated with
it.

When performing any modification to the involved links, this list is
traversed many times, often for nothing, all while holding the RTNL
lock.

Instead, this patch adds an auxilliary rbtree which cuts down
traversal time significantly.

Performance can be seen with the following script:

	#!/bin/bash
	ip netns del test || true 2>/dev/null
	ip netns add test

	echo 1 | ip netns exec test tee /proc/sys/net/ipv6/conf/all/keep_addr_on_down > /dev/null

	set -e

	ip -n test link add foo type veth peer name bar
	ip -n test link add b1 type bond
	ip -n test link add florp type vrf table 10

	ip -n test link set bar master b1
	ip -n test link set foo up
	ip -n test link set bar up
	ip -n test link set b1 up
	ip -n test link set florp up

	VLAN_COUNT=1500
	BASE_DEV=b1

	echo Creating vlans
	ip netns exec test time -p bash -c "for i in \$(seq 1 $VLAN_COUNT);
	do ip -n test link add link $BASE_DEV name foo.\$i type vlan id \$i; done"

	echo Bringing them up
	ip netns exec test time -p bash -c "for i in \$(seq 1 $VLAN_COUNT);
	do ip -n test link set foo.\$i up; done"

	echo Assiging IPv6 Addresses
	ip netns exec test time -p bash -c "for i in \$(seq 1 $VLAN_COUNT);
	do ip -n test address add dev foo.\$i 2000::\$i/64; done"

	echo Attaching to VRF
	ip netns exec test time -p bash -c "for i in \$(seq 1 $VLAN_COUNT);
	do ip -n test link set foo.\$i master florp; done"

On an Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz machine, the performance
before the patch is (truncated):

	Creating vlans
	real 108.35
	Bringing them up
	real 4.96
	Assiging IPv6 Addresses
	real 19.22
	Attaching to VRF
	real 458.84

After the patch:

	Creating vlans
	real 5.59
	Bringing them up
	real 5.07
	Assiging IPv6 Addresses
	real 5.64
	Attaching to VRF
	real 25.37

Cc: David S. Miller <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Lu Wei <luwei32@huawei.com>
Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Cc: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Gilad Naaman <gnaaman@drivenets.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

406f42fa

24 Aug, 2021 10 commits

net: bridge: change return type of br_handle_ingress_vlan_tunnel · a37c5c26

Kangmin Park authored Aug 23, 2021

br_handle_ingress_vlan_tunnel() is only referenced in
br_handle_frame(). If br_handle_ingress_vlan_tunnel() is called and
return non-zero value, goto drop in br_handle_frame().

But, br_handle_ingress_vlan_tunnel() always return 0. So, the
routines that check the return value and goto drop has no meaning.

Therefore, change return type of br_handle_ingress_vlan_tunnel() to
void and remove if statement of br_handle_frame().
Signed-off-by: Kangmin Park <l4stpr0gr4m@gmail.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Link: https://lore.kernel.org/r/20210823102118.17966-1-l4stpr0gr4m@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

a37c5c26

selftests/net: Use kselftest skip code for skipped tests · 7844ec21

Po-Hsu Lin authored Aug 23, 2021

There are several test cases in the net directory are still using
exit 0 or exit 1 when they need to be skipped. Use kselftest
framework skip code instead so it can help us to distinguish the
return status.

Criterion to filter out what should be fixed in net directory:
  grep -r "exit [01]" -B1 | grep -i skip

This change might cause some false-positives if people are running
these test scripts directly and only checking their return codes,
which will change from 0 to 4. However I think the impact should be
small as most of our scripts here are already using this skip code.
And there will be no such issue if running them with the kselftest
framework.
Signed-off-by: Po-Hsu Lin <po-hsu.lin@canonical.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/20210823085854.40216-1-po-hsu.lin@canonical.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

7844ec21

igc: Add support for PTP getcrosststamp() · a90ec848

Vinicius Costa Gomes authored Jul 26, 2021

i225 supports PCIe Precision Time Measurement (PTM), allowing us to
support the PTP_SYS_OFFSET_PRECISE ioctl() in the driver via the
getcrosststamp() function.

The easiest way to expose the PTM registers would be to configure the PTM
dialogs to run periodically, but the PTP_SYS_OFFSET_PRECISE ioctl()
semantics are more aligned to using a kind of "one-shot" way of retrieving
the PTM timestamps. But this causes a bit more code to be written: the
trigger registers for the PTM dialogs are not cleared automatically.

i225 can be configured to send "fake" packets with the PTM
information, adding support for handling these types of packets is
left for the future.

PTM improves the accuracy of time synchronization, for example, using
phc2sys, while a simple application is sending packets as fast as
possible. First, without .getcrosststamp():

phc2sys[191.382]: enp4s0 sys offset -959 s2 freq -454 delay 4492
phc2sys[191.482]: enp4s0 sys offset 798 s2 freq +1015 delay 4069
phc2sys[191.583]: enp4s0 sys offset 962 s2 freq +1418 delay 3849
phc2sys[191.683]: enp4s0 sys offset 924 s2 freq +1669 delay 3753
phc2sys[191.783]: enp4s0 sys offset 664 s2 freq +1686 delay 3349
phc2sys[191.883]: enp4s0 sys offset 218 s2 freq +1439 delay 2585
phc2sys[191.983]: enp4s0 sys offset 761 s2 freq +2048 delay 3750
phc2sys[192.083]: enp4s0 sys offset 756 s2 freq +2271 delay 4061
phc2sys[192.183]: enp4s0 sys offset 809 s2 freq +2551 delay 4384
phc2sys[192.283]: enp4s0 sys offset -108 s2 freq +1877 delay 2480
phc2sys[192.383]: enp4s0 sys offset -1145 s2 freq +807 delay 4438
phc2sys[192.484]: enp4s0 sys offset 571 s2 freq +2180 delay 3849
phc2sys[192.584]: enp4s0 sys offset 241 s2 freq +2021 delay 3389
phc2sys[192.684]: enp4s0 sys offset 405 s2 freq +2257 delay 3829
phc2sys[192.784]: enp4s0 sys offset 17 s2 freq +1991 delay 3273
phc2sys[192.884]: enp4s0 sys offset 152 s2 freq +2131 delay 3948
phc2sys[192.984]: enp4s0 sys offset -187 s2 freq +1837 delay 3162
phc2sys[193.084]: enp4s0 sys offset -1595 s2 freq +373 delay 4557
phc2sys[193.184]: enp4s0 sys offset 107 s2 freq +1597 delay 3740
phc2sys[193.284]: enp4s0 sys offset 199 s2 freq +1721 delay 4010
phc2sys[193.385]: enp4s0 sys offset -169 s2 freq +1413 delay 3701
phc2sys[193.485]: enp4s0 sys offset -47 s2 freq +1484 delay 3581
phc2sys[193.585]: enp4s0 sys offset -65 s2 freq +1452 delay 3778
phc2sys[193.685]: enp4s0 sys offset 95 s2 freq +1592 delay 3888
phc2sys[193.785]: enp4s0 sys offset 206 s2 freq +1732 delay 4445
phc2sys[193.885]: enp4s0 sys offset -652 s2 freq +936 delay 2521
phc2sys[193.985]: enp4s0 sys offset -203 s2 freq +1189 delay 3391
phc2sys[194.085]: enp4s0 sys offset -376 s2 freq +955 delay 2951
phc2sys[194.185]: enp4s0 sys offset -134 s2 freq +1084 delay 3330
phc2sys[194.285]: enp4s0 sys offset -22 s2 freq +1156 delay 3479
phc2sys[194.386]: enp4s0 sys offset 32 s2 freq +1204 delay 3602
phc2sys[194.486]: enp4s0 sys offset 122 s2 freq +1303 delay 3731

Statistics for this run (total of 2179 lines), in nanoseconds:
average: -1.12
stdev: 634.80
max: 1551
min: -2215

With .getcrosststamp() via PCIe PTM:

phc2sys[367.859]: enp4s0 sys offset 6 s2 freq +1727 delay 0
phc2sys[367.959]: enp4s0 sys offset -2 s2 freq +1721 delay 0
phc2sys[368.059]: enp4s0 sys offset 5 s2 freq +1727 delay 0
phc2sys[368.160]: enp4s0 sys offset -1 s2 freq +1723 delay 0
phc2sys[368.260]: enp4s0 sys offset -4 s2 freq +1719 delay 0
phc2sys[368.360]: enp4s0 sys offset -5 s2 freq +1717 delay 0
phc2sys[368.460]: enp4s0 sys offset 1 s2 freq +1722 delay 0
phc2sys[368.560]: enp4s0 sys offset -3 s2 freq +1718 delay 0
phc2sys[368.660]: enp4s0 sys offset 5 s2 freq +1725 delay 0
phc2sys[368.760]: enp4s0 sys offset -1 s2 freq +1721 delay 0
phc2sys[368.860]: enp4s0 sys offset 0 s2 freq +1721 delay 0
phc2sys[368.960]: enp4s0 sys offset 0 s2 freq +1721 delay 0
phc2sys[369.061]: enp4s0 sys offset 4 s2 freq +1725 delay 0
phc2sys[369.161]: enp4s0 sys offset 1 s2 freq +1724 delay 0
phc2sys[369.261]: enp4s0 sys offset 4 s2 freq +1727 delay 0
phc2sys[369.361]: enp4s0 sys offset 8 s2 freq +1732 delay 0
phc2sys[369.461]: enp4s0 sys offset 7 s2 freq +1733 delay 0
phc2sys[369.561]: enp4s0 sys offset 4 s2 freq +1733 delay 0
phc2sys[369.661]: enp4s0 sys offset 1 s2 freq +1731 delay 0
phc2sys[369.761]: enp4s0 sys offset 1 s2 freq +1731 delay 0
phc2sys[369.861]: enp4s0 sys offset -5 s2 freq +1725 delay 0
phc2sys[369.961]: enp4s0 sys offset -4 s2 freq +1725 delay 0
phc2sys[370.062]: enp4s0 sys offset 2 s2 freq +1730 delay 0
phc2sys[370.162]: enp4s0 sys offset -7 s2 freq +1721 delay 0
phc2sys[370.262]: enp4s0 sys offset -3 s2 freq +1723 delay 0
phc2sys[370.362]: enp4s0 sys offset 1 s2 freq +1726 delay 0
phc2sys[370.462]: enp4s0 sys offset -3 s2 freq +1723 delay 0
phc2sys[370.562]: enp4s0 sys offset -1 s2 freq +1724 delay 0
phc2sys[370.662]: enp4s0 sys offset -4 s2 freq +1720 delay 0
phc2sys[370.762]: enp4s0 sys offset -7 s2 freq +1716 delay 0
phc2sys[370.862]: enp4s0 sys offset -2 s2 freq +1719 delay 0

Statistics for this run (total of 2179 lines), in nanoseconds:
average: 0.14
stdev: 5.03
max: 48
min: -27

For reference, the statistics for runs without PCIe congestion show
that the improvements from enabling PTM are less dramatic. For two
runs of 16466 entries:
without PTM: avg -0.04 stdev 10.57 max 39 min -42
with PTM: avg 0.01 stdev 4.20 max 19 min -16

One possible explanation is that when PTM is not enabled, and there's a lot
of traffic in the PCIe fabric, some register reads will take more time
than the others because of congestion on the PCIe fabric.

When PTM is enabled, even if the PTM dialogs take more time to
complete under heavy traffic, the time measurements do not depend on
the time to read the registers.

This was implemented following the i225 EAS version 0.993.
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Tested-by: Dvora Fuxbrumer <dvorax.fuxbrumer@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

a90ec848

igc: Enable PCIe PTM · 1b5d73fb

Vinicius Costa Gomes authored Jul 26, 2021

Enables PCIe PTM (Precision Time Measurement) support in the igc
driver. Notifies the PCI devices that PCIe PTM should be enabled.

PCIe PTM is similar protocol to PTP (Precision Time Protocol) running
in the PCIe fabric, it allows devices to report time measurements from
their internal clocks and the correlation with the PCIe root clock.

The i225 NIC exposes some registers that expose those time
measurements, those registers will be used, in later patches, to
implement the PTP_SYS_OFFSET_PRECISE ioctl().
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Tested-by: Dvora Fuxbrumer <dvorax.fuxbrumer@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

1b5d73fb

PCI: Add pcie_ptm_enabled() · 014408cd

Vinicius Costa Gomes authored Jul 26, 2021

Add a predicate that returns if PCIe PTM (Precision Time Measurement)
is enabled.

It will only return true if it's enabled in all the ports in the path
from the device to the root.
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

014408cd

Revert "PCI: Make pci_enable_ptm() private" · 1d71eb53

Vinicius Costa Gomes authored Jul 26, 2021

Make pci_enable_ptm() accessible from the drivers.

Exposing this to the driver enables the driver to use the
'ptm_enabled' field of 'pci_dev' to check if PTM is enabled or not.

This reverts commit ac6c26da ("PCI: Make pci_enable_ptm() private").
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

1d71eb53

Merge branch 'ethtool-extend-coalesce-uapi' · 3a62c333

Jakub Kicinski authored Aug 24, 2021

Yufeng Mo says:

====================
ethtool: extend coalesce uAPI

In order to support some configuration in coalesce uAPI, this series
extend coalesce uAPI and add support for CQE mode.

Below is some test result with HNS3 driver:
1. old ethtool(ioctl) + new kernel:
estuary:/$ ethtool -c eth0
Coalesce parameters for eth0:
Adaptive RX: on  TX: on
stats-block-usecs: 0
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0

rx-usecs: 20
rx-frames: 0
rx-usecs-irq: 0
rx-frames-irq: 0

tx-usecs: 20
tx-frames: 0
tx-usecs-irq: 0
tx-frames-irq: 0

rx-usecs-low: 0
rx-frame-low: 0
tx-usecs-low: 0
tx-frame-low: 0

rx-usecs-high: 0
rx-frame-high: 0
tx-usecs-high: 0
tx-frame-high: 0

2. ethtool(netlink with cqe mode) + kernel without cqe mode:
estuary:/$ ethtool -c eth0
Coalesce parameters for eth0:
Adaptive RX: on  TX: on
stats-block-usecs: n/a
sample-interval: n/a
pkt-rate-low: n/a
pkt-rate-high: n/a

rx-usecs: 20
rx-frames: 0
rx-usecs-irq: n/a
rx-frames-irq: n/a

tx-usecs: 20
tx-frames: 0
tx-usecs-irq: n/a
tx-frames-irq: n/a

rx-usecs-low: n/a
rx-frame-low: n/a
tx-usecs-low: n/a
tx-frame-low: n/a

rx-usecs-high: 0
rx-frame-high: n/a
tx-usecs-high: 0
tx-frame-high: n/a

CQE mode RX: n/a  TX: n/a

3. ethool(netlink with cqe mode) + kernel with cqe mode:
estuary:/$ ethtool -c eth0
Coalesce parameters for eth0:
Adaptive RX: on  TX: on
stats-block-usecs: n/a
sample-interval: n/a
pkt-rate-low: n/a
pkt-rate-high: n/a

rx-usecs: 20
rx-frames: 0
rx-usecs-irq: n/a
rx-frames-irq: n/a

tx-usecs: 20
tx-frames: 0
tx-usecs-irq: n/a
tx-frames-irq: n/a

rx-usecs-low: n/a
rx-frame-low: n/a
tx-usecs-low: n/a
tx-frame-low: n/a

rx-usecs-high: 0
rx-frame-high: n/a
tx-usecs-high: 0
tx-frame-high: n/a

CQE mode RX: off  TX: off

4. ethool(netlink without cqe mode) + kernel with cqe mode:
estuary:/$ ethtool -c eth0
Coalesce parameters for eth0:
Adaptive RX: on  TX: on
stats-block-usecs: n/a
sample-interval: n/a
pkt-rate-low: n/a
pkt-rate-high: n/a

rx-usecs: 20
rx-frames: 0
rx-usecs-irq: n/a
rx-frames-irq: n/a

tx-usecs: 20
tx-frames: 0
tx-usecs-irq: n/a
tx-frames-irq: n/a

rx-usecs-low: n/a
rx-frame-low: n/a
tx-usecs-low: n/a
tx-frame-low: n/a

rx-usecs-high: 0
rx-frame-high: n/a
tx-usecs-high: 0
tx-frame-high: n/a

Change log:
V2 -> V3:
         fix some warning on W=1 builds in #2

V1 -> V2:
         1. fix compile error using allmodconfig in #2
         2. move some property-related modifications from #2 to #1
            for better review suggested by Jakub Kicinski.

Change log from RFC:
V3 -> V4:
         add document explaining the difference between CQE and EQE
         in #1 suggested by Jakub Kicinski.

V2 -> V3:
         1. split #1 into adding new parameter and adding new attributes.
         2. use NLA_POLICY_MAX(NLA_U8, 1) instead of NLA_U8.
         3. modify the description of CQE in Document.

V1 -> V2:
         refactor #1&#2 in V1 suggestted by Jakub Kicinski.
====================

Link: https://lore.kernel.org/r/1629444920-25437-1-git-send-email-moyufeng@huawei.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

3a62c333

net: hns3: add ethtool support for CQE/EQE mode configuration · cce1689e

Yufeng Mo authored Aug 20, 2021

Add support in ethtool for switching EQE/CQE mode.
Signed-off-by: Yufeng Mo <moyufeng@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

cce1689e

net: hns3: add support for EQE/CQE mode configuration · 9f0c6f4b

Yufeng Mo authored Aug 20, 2021

For device whose version is above V3(include V3), the GL can
select EQE or CQE mode, so adds support for it.

In CQE mode, the coalesced timer will restart when the first new
completion occurs, while in EQE mode, the timer will not restart.
Signed-off-by: Yufeng Mo <moyufeng@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

9f0c6f4b

ethtool: extend coalesce setting uAPI with CQE mode · f3ccfda1

Yufeng Mo authored Aug 20, 2021

In order to support more coalesce parameters through netlink,
add two new parameter kernel_coal and extack for .set_coalesce
and .get_coalesce, then some extra info can return to user with
the netlink API.
Signed-off-by: Yufeng Mo <moyufeng@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

f3ccfda1