Commits · 49ecc587dca2754571791bebd36e9e36e2a0d973 · Kirill Smelkov / linux

04 Feb, 2021 23 commits

Revert "GTP: add support for flow based tunneling API" · 49ecc587

Jonas Bonn authored Feb 03, 2021

This reverts commit 9ab7e76a.

This patch was committed without maintainer approval and despite a number
of unaddressed concerns from review.  There are several issues that
impede the acceptance of this patch and that make a reversion of this
particular instance of these changes the best way forward:

i)  the patch contains several logically separate changes that would be
better served as smaller patches (for review purposes)
ii) functionality like the handling of end markers has been introduced
without further explanation
iii) symmetry between the handling of GTPv0 and GTPv1 has been
unnecessarily broken
iv) the patchset produces 'broken' packets when extension headers are
included
v) there are no available userspace tools to allow for testing this
functionality
vi) there is an unaddressed Coverity report against the patch concering
memory leakage
vii) most importantly, the patch contains a large amount of superfluous
churn that impedes other ongoing work with this driver

This patch will be reworked into a series that aligns with other
ongoing work and facilitates review.
Signed-off-by: Jonas Bonn <jonas@norrbonn.se>
Acked-by: Harald Welte <laforge@gnumonks.org>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

49ecc587

net: tracepoint: exposing sk_family in all tcp:tracepoints · 3dd344ea

Hariharan Ananthakrishnan authored Jan 29, 2021

Similar to sock:inet_sock_set_state tracepoint, expose sk_family to
distinguish AF_INET and AF_INET6 families.

The following tcp tracepoints are updated:
tcp:tcp_destroy_sock
tcp:tcp_rcv_space_adjust
tcp:tcp_retransmit_skb
tcp:tcp_send_reset
tcp:tcp_receive_reset
tcp:tcp_retransmit_synack
tcp:tcp_probe
Signed-off-by: Hariharan Ananthakrishnan <hari@netflix.com>
Signed-off-by: Brendan Gregg <bgregg@netflix.com>
Link: https://lore.kernel.org/r/20210129001210.344438-1-hari@netflix.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

3dd344ea

tcp: use a smaller percpu_counter batch size for sk_alloc · f5a5589c

Wei Wang authored Feb 02, 2021

Currently, a percpu_counter with the default batch size (2*nr_cpus) is
used to record the total # of active sockets per protocol. This means
sk_sockets_allocated_read_positive() could be off by +/-2*(nr_cpus^2).
This under/over-estimation could lead to wrong memory suppression
conditions in __sk_raise_mem_allocated().
Fix this by using a more reasonable fixed batch size of 16.

See related commit cf86a086 ("net/dst: use a smaller percpu_counter
batch for dst entries accounting") that addresses a similar issue.
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Link: https://lore.kernel.org/r/20210202193408.1171634-1-weiwan@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

f5a5589c

Merge branch 'support-setting-lanes-via-ethtool' · 6fd5eeee

Jakub Kicinski authored Feb 03, 2021

Danielle Ratson says:

====================
Support setting lanes via ethtool

Some speeds can be achieved with different number of lanes. For example,
100Gbps can be achieved using two lanes of 50Gbps or four lanes of
25Gbps. This patchset adds a new selector that allows ethtool to
advertise link modes according to their number of lanes and also force a
specific number of lanes when autonegotiation is off.

Advertising all link modes with a speed of 100Gbps that use two lanes:

$ ethtool -s swp1 speed 100000 lanes 2 autoneg on

Forcing a speed of 100Gbps using four lanes:

$ ethtool -s swp1 speed 100000 lanes 4 autoneg off
====================

Link: https://lore.kernel.org/r/20210202180612.325099-1-danieller@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

6fd5eeee

net: selftests: Add lanes setting test · f72e2f48

Danielle Ratson authored Feb 02, 2021

Test that setting lanes parameter is working.

Set max speed and max lanes in the list of advertised link modes,
and then try to set max speed with the lanes below max lanes if exists
in the list.

And then, test that setting number of lanes larger than max lanes fails.

Do the above for both autoneg on and off.

$ ./ethtool_lanes.sh

TEST: 4 lanes is autonegotiated                                     [ OK ]
TEST: Lanes number larger than max width is not set                 [ OK ]
TEST: Autoneg off, 4 lanes detected during force mode               [ OK ]
TEST: Lanes number larger than max width is not set                 [ OK ]
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

f72e2f48

mlxsw: ethtool: Pass link mode in use to ethtool · 25a96f05

Danielle Ratson authored Feb 02, 2021

Currently, when user space queries the link's parameters, as speed and
duplex, each parameter is passed from the driver to ethtool.

Instead, pass the link mode bit in use.
In Spectrum-1, simply pass the bit that is set to '1' from PTYS register.
In Spectrum-2, pass the first link mode bit in the mask of the used
link mode.
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

25a96f05

mlxsw: ethtool: Add support for setting lanes when autoneg is off · 763ece86

Danielle Ratson authored Feb 02, 2021

Currently, when auto negotiation is set to off, the user can force a
specific speed or both speed and duplex. The user cannot influence the
number of lanes that will be forced.

Add support for setting speed along with lanes so one would be able
to choose how many lanes will be forced.

When lanes parameter is passed from user space, choose the link mode
that its actual width equals to it.
Otherwise, the default link mode will be the one that supports the width
of the port.
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

763ece86

mlxsw: ethtool: Remove max lanes filtering · 5fc4053d

Danielle Ratson authored Feb 02, 2021

Currently, when a speed can be supported by different number of lanes,
the supported link modes bitmask contains only link modes with a single
number of lanes.

This was done in order to prevent auto negotiation on number of
lanes after 50G-1-lane and 100G-2-lanes link modes were introduced.

For example, if a port's max width is 4, only link modes with 4 lanes
will be presented as supported by that port, so 100G is always achieved by
4 lanes of 25G.

After the previous patches that allow selection of the number of lanes,
auto negotiation on number of lanes becomes practical.

Remove that filtering of the maximum number of lanes supported link modes,
so indeed all the supported and advertised link modes will be shown.
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

5fc4053d

ethtool: Expose the number of lanes in use · 7dc33f09

Danielle Ratson authored Feb 02, 2021

Currently, ethtool does not expose how many lanes are used when the
link is up.

After adding a possibility to advertise or force a specific number of
lanes, the lanes in use value can be either the maximum width of the port
or below.

Extend ethtool to expose the number of lanes currently in use for
drivers that support it.

For example:

$ ethtool -s swp1 speed 100000 lanes 4
$ ethtool -s swp2 speed 100000 lanes 4
$ ip link set swp1 up
$ ip link set swp2 up
$ ethtool swp1
Settings for swp1:
        Supported ports: [ FIBRE         Backplane ]
        Supported link modes:   1000baseT/Full
                                10000baseT/Full
                                1000baseKX/Full
                                10000baseKR/Full
                                10000baseR_FEC
                                40000baseKR4/Full
                                40000baseCR4/Full
                                40000baseSR4/Full
                                40000baseLR4/Full
                                25000baseCR/Full
                                25000baseKR/Full
                                25000baseSR/Full
                                50000baseCR2/Full
                                50000baseKR2/Full
                                100000baseKR4/Full
                                100000baseSR4/Full
                                100000baseCR4/Full
                                100000baseLR4_ER4/Full
                                50000baseSR2/Full
                                10000baseCR/Full
                                10000baseSR/Full
                                10000baseLR/Full
                                10000baseER/Full
                                50000baseKR/Full
                                50000baseSR/Full
                                50000baseCR/Full
                                50000baseLR_ER_FR/Full
                                50000baseDR/Full
                                100000baseKR2/Full
                                100000baseSR2/Full
                                100000baseCR2/Full
                                100000baseLR2_ER2_FR2/Full
                                100000baseDR2/Full
                                200000baseKR4/Full
                                200000baseSR4/Full
                                200000baseLR4_ER4_FR4/Full
                                200000baseDR4/Full
                                200000baseCR4/Full
        Supported pause frame use: Symmetric Receive-only
        Supports auto-negotiation: Yes
        Supported FEC modes: Not reported
        Advertised link modes:  1000baseT/Full
                                10000baseT/Full
                                1000baseKX/Full
                                1000baseKX/Full
                                10000baseKR/Full
                                10000baseR_FEC
                                40000baseKR4/Full
                                40000baseCR4/Full
                                40000baseSR4/Full
                                40000baseLR4/Full
                                25000baseCR/Full
                                25000baseKR/Full
                                25000baseSR/Full
                                50000baseCR2/Full
                                50000baseKR2/Full
                                100000baseKR4/Full
                                100000baseSR4/Full
                                100000baseCR4/Full
                                100000baseLR4_ER4/Full
                                50000baseSR2/Full
                                10000baseCR/Full
                                10000baseSR/Full
                                10000baseLR/Full
                                10000baseER/Full
                                200000baseKR4/Full
                                200000baseSR4/Full
                                200000baseLR4_ER4_FR4/Full
                                200000baseDR4/Full
                                200000baseCR4/Full
        Advertised pause frame use: No
        Advertised auto-negotiation: Yes
        Advertised FEC modes: Not reported
        Advertised link modes:  100000baseKR4/Full
                                100000baseSR4/Full
                                100000baseCR4/Full
                                100000baseLR4_ER4/Full
	Advertised pause frame use: No
	Advertised auto-negotiation: Yes
	Advertised FEC modes: Not reported
	Speed: 100000Mb/s
	Lanes: 4
	Duplex: Full
	Auto-negotiation: on
	Port: Direct Attach Copper
	PHYAD: 0
	Transceiver: internal
	Link detected: yes
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

7dc33f09

ethtool: Get link mode in use instead of speed and duplex parameters · c8907043

Danielle Ratson authored Feb 02, 2021

Currently, when user space queries the link's parameters, as speed and
duplex, each parameter is passed from the driver to ethtool.

Instead, get the link mode bit in use, and derive each of the parameters
from it in ethtool.
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

c8907043

ethtool: Extend link modes settings uAPI with lanes · 012ce4dd

Danielle Ratson authored Feb 02, 2021

Currently, when auto negotiation is on, the user can advertise all the
linkmodes which correspond to a specific speed, but does not have a
similar selector for the number of lanes. This is significant when a
specific speed can be achieved using different number of lanes.  For
example, 2x50 or 4x25.

Add 'ETHTOOL_A_LINKMODES_LANES' attribute and expand 'struct
ethtool_link_settings' with lanes field in order to implement a new
lanes-selector that will enable the user to advertise a specific number
of lanes as well.

When auto negotiation is off, lanes parameter can be forced only if the
driver supports it. Add a capability bit in 'struct ethtool_ops' that
allows ethtool know if the driver can handle the lanes parameter when
auto negotiation is off, so if it does not, an error message will be
returned when trying to set lanes.

Example:

$ ethtool -s swp1 lanes 4
$ ethtool swp1
  Settings for swp1:
	Supported ports: [ FIBRE ]
        Supported link modes:   1000baseKX/Full
                                10000baseKR/Full
                                40000baseCR4/Full
				40000baseSR4/Full
				40000baseLR4/Full
                                25000baseCR/Full
                                25000baseSR/Full
				50000baseCR2/Full
                                100000baseSR4/Full
				100000baseCR4/Full
        Supported pause frame use: Symmetric Receive-only
        Supports auto-negotiation: Yes
        Supported FEC modes: Not reported
        Advertised link modes:  40000baseCR4/Full
				40000baseSR4/Full
				40000baseLR4/Full
                                100000baseSR4/Full
				100000baseCR4/Full
        Advertised pause frame use: No
        Advertised auto-negotiation: Yes
        Advertised FEC modes: Not reported
        Speed: Unknown!
        Duplex: Unknown! (255)
        Auto-negotiation: on
        Port: Direct Attach Copper
        PHYAD: 0
        Transceiver: internal
        Link detected: no
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

012ce4dd

ethtool: Validate master slave configuration before rtnl_lock() · 189e7a8d

Danielle Ratson authored Feb 02, 2021

Create a new function for input validations to be called before
rtnl_lock() and move the master slave validation to that function.

This would be a cleanup for next patch that would add another validation
to the new function.
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

189e7a8d

net: dsa: fix SWITCHDEV_ATTR_ID_BRIDGE_VLAN_FILTERING getting ignored · 99b8202b

Vladimir Oltean authored Feb 03, 2021

The bridge emits VLAN filtering events and quite a few others via
switchdev with orig_dev = br->dev. After the blamed commit, these events
started getting ignored.

The point of the patch was to not offload switchdev objects for ports
that didn't go through dsa_port_bridge_join, because the configuration
is unsupported:
- ports that offload a bonding/team interface go through
  dsa_port_bridge_join when that bonding/team interface is later bridged
  with another switch port or LAG
- ports that don't offload LAG don't get notified of the bridge that is
  on top of that LAG.

Sadly, a check is missing, which is that the orig_dev is equal to the
bridge device. This check is compatible with the original intention,
because ports that don't offload bridging because they use a software
LAG don't have dp->bridge_dev set.

On a semi-related note, we should not offload switchdev objects or
populate dp->bridge_dev if the driver doesn't implement .port_bridge_join
either. However there is no regression associated with that, so it can
be done separately.

Fixes: 5696c8ae ("net: dsa: Don't offload port attributes on standalone ports")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Tobias Waldekranz <tobias@waldekranz.com>
Tested-by: Tobias Waldekranz <tobias@waldekranz.com>
Link: https://lore.kernel.org/r/20210202233109.1591466-1-olteanv@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

99b8202b

Merge branch 'chelsio-cxgb-use-threaded-interrupts-for-deferred-work' · 75b8f78f

Jakub Kicinski authored Feb 03, 2021

Sebastian Andrzej Siewior says:

====================
chelsio: cxgb: Use threaded interrupts for deferred work

Patch #2 fixes an issue in which del_timer_sync() and tasklet_kill() is
invoked from the interrupt handler. This is probably a rare error case
since it disables interrupts / the card in that case.
Patch #1 converts a worker to use a threaded interrupt which is then
also used in patch #2 instead adding another worker for this task (and
flush_work() to synchronise vs rmmod).
====================

Link: https://lore.kernel.org/r/20210202170104.1909200-1-bigeasy@linutronix.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>

75b8f78f

chelsio: cxgb: Disable the card on error in threaded interrupt · 82154580

Sebastian Andrzej Siewior authored Feb 02, 2021

t1_fatal_err() is invoked from the interrupt handler. The bad part is
that it invokes (via t1_sge_stop()) del_timer_sync() and tasklet_kill().
Both functions must not be called from an interrupt because it is
possible that it will wait for the completion of the timer/tasklet it
just interrupted.

In case of a fatal error, use t1_interrupts_disable() to disable all
interrupt sources and then wake the interrupt thread with
F_PL_INTR_SGE_ERR as pending flag. The threaded-interrupt will stop the
card via t1_sge_stop() and not re-enable the interrupts again.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

82154580

chelsio: cxgb: Replace the workqueue with threaded interrupt · fec7fa0a

Sebastian Andrzej Siewior authored Feb 02, 2021

The external interrupt (F_PL_INTR_EXT) needs to be handled in a process
context and this is accomplished by utilizing a workqueue.

The process context can also be provided by a threaded interrupt instead
of a workqueue. The threaded interrupt can be used later for other
interrupt related processing which require non-atomic context without
using yet another workqueue. free_irq() also ensures that the thread is
done which is currently missing (the worker could continue after the
module has been removed).

Save pending flags in pending_thread_intr. Use the same mechanism
to disable F_PL_INTR_EXT as interrupt source like it is used before the
worker is scheduled. Enable the interrupt again once
t1_elmer0_ext_intr_handler() is done.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

fec7fa0a

Merge branch 'support-for-octeontx2-98xx-cpt-block' · 462e99a1

Jakub Kicinski authored Feb 03, 2021

Srujana Challa says:

====================
Support for OcteonTX2 98xx CPT block.

OcteonTX2 series of silicons have multiple variants, the
98xx variant has two crypto (CPT) blocks to double the crypto
performance. This patchset adds support for new CPT block(CPT1).
====================

Link: https://lore.kernel.org/r/20210202152709.20450-1-schalla@marvell.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

462e99a1

octeontx2-af: Handle CPT function level reset · c57c58fd

Srujana Challa authored Feb 02, 2021

When FLR is initiated for a VF (PCI function level reset),
the parent PF gets a interrupt. PF then sends a message to
admin function (AF), which then cleans up all resources
attached to that VF. This patch adds support to handle
CPT FLR.
Signed-off-by: Narayana Prasad Raju Atherya <pathreya@marvell.com>
Signed-off-by: Suheil Chandran <schandran@marvell.com>
Signed-off-by: Sunil Kovvuri Goutham <sgoutham@marvell.com>
Signed-off-by: Srujana Challa <schalla@marvell.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

c57c58fd

octeontx2-af: Add support for CPT1 in debugfs · b0f60fab

Srujana Challa authored Feb 02, 2021

Adds support to display block CPT1 stats at
"/sys/kernel/debug/octeontx2/cpt1".
Signed-off-by: Mahipal Challa <mchalla@marvell.com>
Signed-off-by: Srujana Challa <schalla@marvell.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

b0f60fab

octeontx2-af: Mailbox changes for 98xx CPT block · de2854c8

Srujana Challa authored Feb 02, 2021

This patch changes CPT mailbox message format to
support new block CPT1 in 98xx silicon.

cpt_rd_wr_reg ->
    Modify cpt_rd_wr_reg mailbox and its handler to
    accommodate new block CPT1.
cpt_lf_alloc ->
    Modify cpt_lf_alloc mailbox and its handler to
    configure LFs from a block address out of multiple
    blocks of same type. If a PF/VF needs to configure
    LFs from both the blocks then this mbox should be
    called twice.
Signed-off-by: Mahipal Challa <mchalla@marvell.com>
Signed-off-by: Srujana Challa <schalla@marvell.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

de2854c8

net: mdiobus: Prevent spike on MDIO bus reset signal · e0183b97

Mike Looijmans authored Feb 02, 2021

The mdio_bus reset code first de-asserted the reset by allocating with
GPIOD_OUT_LOW, then asserted and de-asserted again. In other words, if
the reset signal defaulted to asserted, there'd be a short "spike"
before the reset.

Here is what happens depending on the pre-existing state of the reset
signal:
Reset (previously asserted):   ~~~|_|~~~~|_______
Reset (previously deasserted): _____|~~~~|_______
                                  ^ ^    ^
                                  A B    C

At point A, the low going transition is because the reset line is
requested using GPIOD_OUT_LOW. If the line is successfully requested,
the first thing we do is set it high _without_ any delay. This is
point B. So, a glitch occurs between A and B.

We then fsleep() and finally set the GPIO low at point C.

Requesting the line using GPIOD_OUT_HIGH eliminates the A and B
transitions. Instead we get:

Reset (previously asserted)  : ~~~~~~~~~~|______
Reset (previously deasserted): ____|~~~~~|______
                                   ^     ^
                                   A     C

Where A and C are the points described above in the code. Point B
has been eliminated.

The issue was found when we pulled down the reset signal for the
Marvell 88E1512P PHY (because it requires at least 50ms after POR with
an active clock). Looking at the reset signal with a scope revealed a
short spike, point B in the artwork above.
Signed-off-by: Mike Looijmans <mike.looijmans@topic.nl>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20210202143239.10714-1-mike.looijmans@topic.nlSigned-off-by: Jakub Kicinski <kuba@kernel.org>

e0183b97

net: mscc: ocelot: fix error code in mscc_ocelot_probe() · 4160d9ec

Dan Carpenter authored Feb 02, 2021

Probe should return an error code if platform_get_irq_byname() fails
but it returns success instead.

Fixes: 6c30384e ("net: mscc: ocelot: register devlink ports")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/YBkXyFIl4V9hgxYM@mwandaSigned-off-by: Jakub Kicinski <kuba@kernel.org>

4160d9ec

net: mscc: ocelot: fix error handling bugs in mscc_ocelot_init_ports() · e0c16233

Dan Carpenter authored Feb 02, 2021

There are several error handling bugs in mscc_ocelot_init_ports().  I
went through the code, and carefully audited it and made fixes and
cleanups.

1) The ocelot_probe_port() function didn't have a mirror release function
   so it was hard to follow.  I created the ocelot_release_port()
   function.
2) In the ocelot_probe_port() function, if the register_netdev() call
   failed, then it lead to a double free_netdev(dev) bug.  Fix this by
   setting "ocelot->ports[port] = NULL" on the error path.
3) I was concerned that the "port" which comes from of_property_read_u32()
   might be out of bounds so I added a check for that.
4) In the original code if ocelot_regmap_init() failed then the driver
   tried to continue but I think that should be a fatal error.
5) If ocelot_probe_port() failed then the most recent devlink was leaked.
   The fix for mostly came Vladimir Oltean.  Get rid of "registered_ports"
   and just set a bit in "devlink_ports_registered" to say when the
   devlink port has been registered (and needs to be unregistered on
   error).  There are fewer than 32 ports so a u32 is large enough for
   this purpose.
6) The error handling if the final ocelot_port_devlink_init() failed had
   two problems.  The "while (port-- >= 0)" loop should have been
   "--port" pre-op instead of a post-op to avoid a buffer underflow.
   The "if (!registered_ports[port])" condition was reversed leading to
   resource leaks and double frees.

Fixes: 6c30384e ("net: mscc: ocelot: register devlink ports")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/YBkXhqRxHtRGzSnJ@mwandaSigned-off-by: Jakub Kicinski <kuba@kernel.org>

e0c16233

03 Feb, 2021 17 commits

Merge branch 'net-use-indirect_call-in-some-dst_ops' · 2d912da0

Jakub Kicinski authored Feb 03, 2021

Brian Vazquez says:

====================
net: use INDIRECT_CALL in some dst_ops

This patch series uses the INDIRECT_CALL wrappers in some dst_ops
functions to mitigate retpoline costs. Benefits depend on the
platform as described below.

Background: The kernel rewrites the retpoline code at
__x86_indirect_thunk_r11 depending on the CPU's requirements.
The INDIRECT_CALL wrappers provide hints on possible targets and
save the retpoline overhead using a direct call in case the
target matches one of the hints.

The retpoline overhead for the following three cases has been
measured by Luigi Rizzo in microbenchmarks, using CPU performance
counters, and cover reasonably well the range of possible retpoline
overheads compared to a plain indirect call (in equal conditions,
specifically with predicted branch, hot cache):

- just "jmp *(%r11)" on modern platforms like Intel Cascadelake.
  In this case the overhead is just 2 clock cycles:

- "lfence; jmp *(%r11)" on e.g. some recent AMD CPUs.
  In this case the lfence is blocked until pending reads complete,
  so the actual overhead depends on previous instructions.
  The best case we have measured 15 clock cycles of overhead.

- worst case, e.g. skylake, the full retpoline is used

    __x86_indirect_thunk_r11:     call set_u_target
    capture_speculation:          pause
                                  lfence
                                  jmp capture_speculation
    .align 16
    set_up_target:                mov %r11, (%rsp)
                                  ret

   In this case the overhead has been measured in 35-40 clock cycles.

The actual time saved hence depends on the platform and current
clock speed (which varies heavily, especially when C-states are active).
Also note that actual benefit might be lower than expected if the
longer retpoline overlaps with some pending memory read.

MEASUREMENTS:
The INDIRECT_CALL wrappers in this patchset involve the processing
of incoming SYN and generation of syncookies. Hence, the test has been
run by configuring a receiving host with a single NIC rx queue, disabling
RPS and RFS so that all processing occurs on the same core.
An external source generates SYN fast enough to saturate the receiving CPU.
We ran two sets of experiments, with and without the dst_output patch,
comparing the number of syncookies generated over a 20s period
in multiple runs.

Assuming the CPU is saturated, the time per packet is
   t = number_of_packets/total_time
and if the two datasets have statistically meaningful difference,
the difference in times between the two cases gives an estimate
of the benefits from one INDIRECT_CALL.

Here are the experimental results:

Skylake     Syncookies over 20s (5 tests)
---------------------------------------------------
indirect    9166325 9182023 9170093 9134014 9171082
retpoline   9099308 9126350 9154841 9056377 9122376

Computing the stats on the ns_pkt = 20e6/total_packets gives the following:

$ ministat -c 95 -w 70 /tmp/sk-indirect /tmp/sk-retp
x /tmp/sk-indirect
+ /tmp/sk-retp
+----------------------------------------------------------------------+
|x     xx x     +          x    + +           +                       +|
||______M__A_______|_|____________M_____A___________________|          |
+----------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5   2.17817e-06   2.18962e-06     2.181e-06  2.182292e-06 4.3252133e-09
+   5   2.18464e-06   2.20839e-06   2.19241e-06  2.194974e-06 8.8695958e-09
Difference at 95.0% confidence
        1.2682e-08 +/- 1.01766e-08
        0.581132% +/- 0.466326%
        (Student's t, pooled s = 6.97772e-09)

This suggests a difference of 13ns +/- 10ns
Our expectation from microbenchmarks was 35-40 cycles per call,
but part of the gains may be eaten by stalls from pending memory reads.

For Cascadelake:
Cascadelake     Syncookies over 20s (5 tests)
---------------------------------------------------------
indirect     10339797 10297547 10366826 10378891 10384854
retpoline    10332674 10366805 10320374 10334272 10374087

Computing the stats on the ns_pkt = 20e6/total_packets gives no
meaningful difference even at just 80% (this was expected):

$ ministat -c 80 -w 70 /tmp/cl-indirect /tmp/cl-retp
x /tmp/cl-indirect
+ /tmp/cl-retp
+----------------------------------------------------------------------+
|   x    x  +     *                   x   + +        +                x|
||______________|_M_________A_____A_______M________|___|               |
+----------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5   1.92588e-06   1.94221e-06   1.92923e-06  1.931716e-06 6.6936746e-09
+   5   1.92788e-06   1.93791e-06   1.93531e-06  1.933188e-06 4.3734106e-09
No difference proven at 80.0% confidence
====================

Link: https://lore.kernel.org/r/20210201174132.3534118-1-brianvv@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

2d912da0

net: indirect call helpers for ipv4/ipv6 dst_check functions · bbd807df

Brian Vazquez authored Feb 01, 2021

This patch avoids the indirect call for the common case:
ip6_dst_check and ipv4_dst_check
Signed-off-by: Brian Vazquez <brianvv@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bbd807df

net: use indirect call helpers for dst_mtu · f67fbeae

Brian Vazquez authored Feb 01, 2021

This patch avoids the indirect call for the common case:
ip6_mtu and ipv4_mtu
Signed-off-by: Brian Vazquez <brianvv@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

f67fbeae

net: use indirect call helpers for dst_output · 6585d7dc

Brian Vazquez authored Feb 01, 2021

This patch avoids the indirect call for the common case:
ip6_output and ip_output
Signed-off-by: Brian Vazquez <brianvv@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

6585d7dc

net: use indirect call helpers for dst_input · e43b2190

Brian Vazquez authored Feb 01, 2021

This patch avoids the indirect call for the common case:
ip_local_deliver and ip6_input
Signed-off-by: Brian Vazquez <brianvv@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

e43b2190

net: usb: cdc_ncm: use new API for bh tasklet · 4f4e5436

Emil Renner Berthing authored Jan 31, 2021

This converts the driver to use the new tasklet API introduced in
commit 12cc923f ("tasklet: Introduce new initialization API")

It is unfortunate that we need to add a pointer to the driver context to
get back to the usbnet device, but the space will be reclaimed once
there are no more users of the old API left and we can remove the data
value and flag from the tasklet struct.
Signed-off-by: Emil Renner Berthing <kernel@esmil.dk>
Link: https://lore.kernel.org/r/20210130234637.26505-1-kernel@esmil.dkSigned-off-by: Jakub Kicinski <kuba@kernel.org>

4f4e5436

net: fec: Silence M5272 build warnings · 32d1bbb1

Geert Uytterhoeven authored Feb 02, 2021

If CONFIG_M5272=y:

    drivers/net/ethernet/freescale/fec_main.c: In function ‘fec_restart’:
    drivers/net/ethernet/freescale/fec_main.c:948:6: warning: unused variable ‘val’ [-Wunused-variable]
      948 |  u32 val;
	  |      ^~~
    drivers/net/ethernet/freescale/fec_main.c: In function ‘fec_get_mac’:
    drivers/net/ethernet/freescale/fec_main.c:1667:28: warning: unused variable ‘pdata’ [-Wunused-variable]
     1667 |  struct fec_platform_data *pdata = dev_get_platdata(&fep->pdev->dev);
	  |                            ^~~~~

Fix this by moving the variable declarations inside the existing #ifdef
blocks.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Link: https://lore.kernel.org/r/20210202130650.865023-1-geert@linux-m68k.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

32d1bbb1

inet: do not export inet_gro_{receive|complete} · fca23f37

Eric Dumazet authored Feb 02, 2021

inet_gro_receive() and inet_gro_complete() are part
of GRO engine which can not be modular.

Similarly, inet_gso_segment() does not need to be exported,
being part of GSO stack.

In other words, net/ipv6/ip6_offload.o is part of vmlinux,
regardless of CONFIG_IPV6.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/20210202154145.1568451-1-eric.dumazet@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

fca23f37

Merge tag 'mac80211-next-for-net-next-2021-02-02' of... · 0256317a

Jakub Kicinski authored Feb 02, 2021

Merge tag 'mac80211-next-for-net-next-2021-02-02' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
This time, only RTNL locking reduction fallout.
 - cfg80211_dev_rename() requires RTNL
 - cfg80211_change_iface() and cfg80211_set_encryption()
   require wiphy mutex (was missing in wireless extensions)
 - cfg80211_destroy_ifaces() requires wiphy mutex
 - netdev registration can fail due to notifiers, and then
   notifiers are "unrolled", need to handle this properly

* tag 'mac80211-next-for-net-next-2021-02-02' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next:
  cfg80211: fix netdev registration deadlock
  cfg80211: call cfg80211_destroy_ifaces() with wiphy lock held
  wext: call cfg80211_set_encryption() with wiphy lock held
  wext: call cfg80211_change_iface() with wiphy lock held
  nl80211: call cfg80211_dev_rename() under RTNL
====================

Link: https://lore.kernel.org/r/20210202144106.38207-1-johannes@sipsolutions.netSigned-off-by: Jakub Kicinski <kuba@kernel.org>

0256317a

Merge tag 'mlx5-updates-2021-02-01' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · 390d9b56

Jakub Kicinski authored Feb 02, 2021

Saeed Mahameed says:

====================
mlx5-updates-2021-02-01

mlx5 netdev updates:

1) Trivial refactoring ahead of the upcoming uplink representor series.
2) Increased RSS table size to 256, for better results
3) Misc. Cleanup and very trivial improvements

* tag 'mlx5-updates-2021-02-01' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
  net/mlx5: DR, Avoid unnecessary csum recalculation on supporting devices
  net/mlx5e: CT: remove useless conversion to PTR_ERR then ERR_PTR
  net/mlx5e: accel, remove redundant space
  net/mlx5e: kTLS, Improve TLS RX workqueue scope
  net/mlx5e: remove h from printk format specifier
  net/mlx5e: Increase indirection RQ table size to 256
  net/mlx5e: Enable napi in channel's activation stage
  net/mlx5e: Move representor neigh init into profile enable
  net/mlx5e: Avoid false lock depenency warning on tc_ht
  net/mlx5e: Move set vxlan nic info to profile init
  net/mlx5e: Move netif_carrier_off() out of mlx5e_priv_init()
  net/mlx5e: Refactor mlx5e_netdev_init/cleanup to mlx5e_priv_init/cleanup
  net/mxl5e: Add change profile method
  net/mlx5e: Separate between netdev objects and mlx5e profiles initialization
====================

Link: https://lore.kernel.org/r/20210202065457.613312-1-saeed@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

390d9b56

Merge branch 'mptcp-add_addr-enhancements' · a1a809c4

Jakub Kicinski authored Feb 02, 2021

Mat Martineau says:

====================
mptcp: ADD_ADDR enhancements

This patch series from the MPTCP tree contains enhancements and
associated tests for the ADD_ADDR ("add address") MPTCP option. This
option allows already-connected MPTCP peers to share additional IP
addresses with each other, which can then be used to create additional
subflows within those MPTCP connections.

Patches 1 & 2 remove duplicated data in the per-connection path manager
structure.

Patches 3-6 initiate additional subflows when an address is added using
the netlink path manager interface and improve ADD_ADDR signaling
reliability, subject to configured limits. Self tests are also updated.

Patches 7-15 add new support for optional port numbers in ADD_ADDR. This
includes creating an additional in-kernel TCP listening socket for the
requested port number, validating the port number when processing
incoming subflow connections, including the port number in netlink
interfaces, and adding some new MIBs. New self test cases are added for
subflows connecting with alternate port numbers.
====================

Link: https://lore.kernel.org/r/20210201230920.66027-1-mathew.j.martineau@linux.intel.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

a1a809c4

selftests: mptcp: add testcases for ADD_ADDR with port · 8a127bf6

Geliang Tang authored Feb 01, 2021

This patch adds testcases for ADD_ADDR with port and the related MIB
counters check in chk_add_nr. The output looks like this:

24 signal address with port syn[ ok ] - synack[ ok ] - ack[ ok ]
add[ ok ] - echo [ ok ] - pt [ ok ]
syn[ ok ] - synack[ ok ] - ack[ ok ]
syn[ ok ] - ack [ ok ]
25 subflow and signal with port syn[ ok ] - synack[ ok ] - ack[ ok ]
add[ ok ] - echo [ ok ] - pt [ ok ]
syn[ ok ] - synack[ ok ] - ack[ ok ]
syn[ ok ] - ack [ ok ]
26 remove single address with port syn[ ok ] - synack[ ok ] - ack[ ok ]
add[ ok ] - echo [ ok ] - pt [ ok ]
syn[ ok ] - synack[ ok ] - ack[ ok ]
syn[ ok ] - ack [ ok ]
rm [ ok ] - sf [ ok ]
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

8a127bf6

mptcp: add the mibs for ADD_ADDR with port · 2fbdd9ea

Geliang Tang authored Feb 01, 2021

This patch adds the mibs for ADD_ADDR with port:

MPTCP_MIB_PORTADD for received ADD_ADDR suboption with a port number.

MPTCP_MIB_PORTSYNRX, MPTCP_MIB_PORTSYNACKRX, MPTCP_MIB_PORTACKRX, for
received MP_JOIN's SYN or SYN/ACK or ACK with a port number which is
different from the msk's port number.

MPTCP_MIB_MISMATCHPORTSYNRX and MPTCP_MIB_MISMATCHPORTACKRX, for
received SYN or ACK MP_JOIN with a mismatched port-number.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

2fbdd9ea

selftests: mptcp: add port argument for pm_nl_ctl · d4a7726a

Geliang Tang authored Feb 01, 2021

This patch adds a new argument for pm_nl_ctl tool. We can use it like
this:

 # pm_nl_ctl add 10.0.2.1 flags signal port 10100
 # pm_nl_ctl dump
 id 1 flags signal 10.0.2.1 10100
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

d4a7726a

mptcp: deal with MPTCP_PM_ADDR_ATTR_PORT in PM netlink · a77e9179

Geliang Tang authored Feb 01, 2021

This patch adds MPTCP_PM_ADDR_ATTR_PORT filling and parsing in PM
netlink.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

a77e9179

mptcp: enable use_port when invoke addresses_equal · 60b57bf7

Geliang Tang authored Feb 01, 2021

When dealing with the addresses list local_addr_list or anno_list, we
should enable the function addresses_equal's parameter use_port. And
enable it in address_zero too.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

60b57bf7

mptcp: add port number check for MP_JOIN · 5bc56388

Geliang Tang authored Feb 01, 2021

This patch adds two new helpers, subflow_use_different_sport and
subflow_use_different_dport, to check whether the subflow's source or
destination port number is different from the msk's port number. When
receiving the MP_JOIN's SYN/SYNACK/ACK, we do these port number checks
and print out the different port numbers.

And furthermore, when receiving the MP_JOIN's SYN/ACK, we also use a new
helper mptcp_pm_sport_in_anno_list to check whether this port number is
announced. If it isn't, we need to abort this connection.

This patch also populates the local address's port field in
local_address.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

5bc56388