Commits · c73e5807e4f6fc6d373a5db55b45f639f8bb6328 · nexedi / linux

11 Nov, 2018 40 commits

tcp: tsq: no longer use limit_output_bytes for paced flows · c73e5807

Eric Dumazet authored Nov 11, 2018

FQ pacing guarantees that paced packets queued by one flow do not
add head-of-line blocking for other flows.

After TCP GSO conversion, increasing limit_output_bytes to 1 MB is safe,
since this maps to 16 skbs at most in qdisc or device queues.
(or slightly more if some drivers lower {gso_max_segs|size})

We still can queue at most 1 ms worth of traffic (this can be scaled
by wifi drivers if they need to)

Tested:

# ethtool -c eth0 | egrep "tx-usecs:|tx-frames:" # 40 Gbit mlx4 NIC
tx-usecs: 16
tx-frames: 16
# tc qdisc replace dev eth0 root fq
# for f in {1..10};do netperf -P0 -H lpaa24,6 -o THROUGHPUT;done

Before patch:
27711
26118
27107
27377
27712
27388
27340
27117
27278
27509

After patch:
37434
36949
36658
36998
37711
37291
37605
36659
36544
37349
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c73e5807

Merge branch 'tcp-tso-defer-improvements' · 83afb36a

David S. Miller authored Nov 11, 2018

Eric Dumazet says:

====================
tcp: tso defer improvements

This series makes tcp_tso_should_defer() a bit smarter :

1) MSG_EOR gives a hint to TCP to not defer some skbs

2) Second patch takes into account that head tstamp
   can be in the future.

3) Third patch uses existing high resolution state variables
   to have a more precise heuristic.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

83afb36a

tcp: get rid of tcp_tso_should_defer() dependency on HZ/jiffies · a682850a

Eric Dumazet authored Nov 11, 2018

tcp_tso_should_defer() first heuristic is to not defer
if last send is "old enough".

Its current implementation uses jiffies and its low granularity.

TSO autodefer performance should not rely on kernel HZ :/

After EDT conversion, we have state variables in nanoseconds that
can allow us to properly implement the heuristic.

This patch increases TSO chunk sizes on medium rate flows,
especially when receivers do not use GRO or similar aggregation.

It also reduces bursts for HZ=100 or HZ=250 kernels, making TCP
behavior more uniform.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a682850a

tcp: refine tcp_tso_should_defer() after EDT adoption · f1c6ea38

Eric Dumazet authored Nov 11, 2018

tcp_tso_should_defer() last step tries to check if the probable
next ACK packet is coming in less than half rtt.

Problem is that the head->tstamp might be in the future,
so we need to use signed arithmetics to avoid overflows.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f1c6ea38

tcp: do not try to defer skbs with eor mark (MSG_EOR) · 1c09f7d0

Eric Dumazet authored Nov 11, 2018

Applications using MSG_EOR are giving a strong hint to TCP stack :

Subsequent sendmsg() can not append more bytes to skbs having
the EOR mark.

Do not try to TSO defer suchs skbs, there is really no hope.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1c09f7d0

tcp: minor optimization in tcp ack fast path processing · 5e13a0d3

Yafang Shao authored Nov 11, 2018

Bitwise operation is a little faster.
So I replace after() with using the flag FLAG_SND_UNA_ADVANCED as it is
already set before.

In addtion, there's another similar improvement in tcp_cwnd_reduction().

Cc: Joe Perches <joe@perches.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5e13a0d3

Merge branch 'mv88e6xxx-Support-more-SERDES-interfacxes' · 0cf3a68a

David S. Miller authored Nov 11, 2018

Andrew Lunn says:

====================
net: dsa: mv88e6xxx: Support more SERDES interfacxes

Currently the SERDES interfaces for ports 9 and 10 on the mv88e6390x
are supported, allowing upto 10G. However, when unused, these SERDES
interfaces can be used by some of the lower ports for 1000Base-X.

The tricky bit here is ordering. The SERDES have to become free from
ports 9 or 10 before they can be used with lower ports. Normally, this
would happen only when these ports would be configured up, which is
too late. So at probe time, defaulting ports 9 and 10 to 1000BaseX
frees them for use with lower ports. If they are actually needed, they
will be taken back when port 9 and 10 goes up.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

0cf3a68a

net: dsa: mv88e6xxx: Add support for SERDES on ports 2-8 for 6390X · 2defda1f

Andrew Lunn authored Nov 11, 2018

The 6390X family has 8 SERDES interfaces. When ports 9 and 10 are not
using all their SERDES interfaces, the unused ones can be assigned to
ports 2-8. Add support for interrupts from SERDES interfaces connected
to these lower ports.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

2defda1f

net: dsa: mv88e6xxx: Default ports 9/10 6390X CMODE to 1000BaseX · 787799a9

Andrew Lunn authored Nov 11, 2018

The 6390X family has 8 SERDES interfaces. This allows ports 9 and 10
to support up to 10Gbps using 4 SERDES interfaces. However, when lower
speeds are used, which need fewer SERDES interfaces, the unused SERDES
interfaces can be used by ports 2-8.

The hardware defaults to ports 9 and 10 having all 4 SERDES interfaces
assigned to them. This only gets changed when the interface is
configured after what the SFP supports has been determined, or the 10G
PHY completes auto-neg.

For hardware designs which limit ports 9 and 10 to one or two SERDES
interfaces, and place SFPs on the lower interfaces, this is too
late. Those ports with SFP should not wait until ports 9/10 are up in
order to get access to the SERDES interface. So change the default
configuration when the driver is initialised. Configure ports 9 and 10
to 1000BaseX, so they use a single SERDES interface, freeing up the
others. They can steal them back if they need them.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

787799a9

net: dsa: mv88e6xxx: Differentiate between 6390 and 6390X cmodes · fdc71eea

Andrew Lunn authored Nov 11, 2018

The X family variants support additional ports modes, for 10G
operation, which the non-X variants don't have. Add a port_set_cmode()
for non-X variants to enforce this.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

fdc71eea

net: dsa: mv88e6xxx: Group cmode ops together · b3dce4da

Andrew Lunn authored Nov 11, 2018

Move .port_set_cmode next to .port_get_cmode.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

b3dce4da

Merge branch 'net-phy-convert-advertise-and-supported-to-linkmode' · 8d2681f5

David S. Miller authored Nov 11, 2018

Andrew Lunn says:

====================
net: phy: convert advertise and supported to linkmode

This is the last part in converting phylib to make use of a linux
bitmap, not a u32, to represent links modes. This will allow support
for PHYs > 1Gbps, which need to use link modes represented by a bit >
32.

A number of MAC and PHY drivers need changes to support this. However
the previous two patchesets reduced the number somewhat, the helpers
which were introduced have been modified instead of the actual
drivers.

The follow on patches then make use of the extra bits, adding support
for more link modes.

Given how invasive this change is, i expect the build is broken for
some architectures i did not test. I will fixup the breakage as fast
as i can.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

8d2681f5

net: phy: Add support for resolving 5G and 2.5G autoneg · cb6402fe

Andrew Lunn authored Nov 10, 2018

Now that 2.5G and 5G can be represented in phydev->advertising and
phydev->lp_advertising, add these two links modes as possible
resolutions to auto negotiation.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

cb6402fe

net: phy: Add more link modes to the settings table · 3c6b59d6

Andrew Lunn authored Nov 10, 2018

Now that PHYs and MAC can support more than 32 bit masks, add link
modes which are > 31 to the PHY settings table.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

3c6b59d6

net: phy: Fixup kerneldoc markup. · fe191914

Andrew Lunn authored Nov 10, 2018

Add missing markup for function parameters
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

fe191914

net: phy: Convert u32 phydev->lp_advertising to linkmode · c0ec3c27

Andrew Lunn authored Nov 10, 2018

Convert phy drivers to report the link partner advertised modes using
a linkmode bitmap. This allows them to report the higher speeds which
don't fit in a u32.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

c0ec3c27

net: ethernet: Convert phydev advertize and supported from u32 to link mode · 3c1bcc86

Andrew Lunn authored Nov 10, 2018

There are a few MAC/PHYs combinations which now support > 1Gbps. These
may need to make use of link modes with bits > 31. Thus their
supported PHY features or advertised features cannot be implemented
using the current bitmap in a u32. Convert to using a linkmode bitmap,
which can support all the currently devices link modes, and is future
proof as more modes are added.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

3c1bcc86

net: phy: remove states PHY_STARTING and PHY_PENDING · 899a3cbb

Heiner Kallweit authored Nov 10, 2018

Both states aren't used. Most likely they result from an idea that
never materialized. So remove them.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

899a3cbb

documentation of some IP/ICMP snmp counters · b08794a9

yupeng authored Nov 10, 2018

The snmp_counter.rst explains the meanings of snmp counters. It also
provides a set of experiments (only 1 for this initial patch),
combines the experiments' resutls and the snmp counters'
meanings. This is an initial path, only explains a part of IP/ICMP
counters and provide a simple ping test.
Signed-off-by: yupeng <yupeng0921@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b08794a9

tipc: improve broadcast retransmission algorithm · 31c4f4cc

LUU Duc Canh authored Nov 10, 2018

Currently, the broadcast retransmission algorithm is using the
'prev_retr' field in struct tipc_link to time stamp the latest broadcast
retransmission occasion. This helps to restrict retransmission of
individual broadcast packets to max once per 10 milliseconds, even
though all other criteria for retransmission are met.

We now move this time stamp to the control block of each individual
packet, and remove other limiting criteria. This simplifies the
retransmission algorithm, and eliminates any risk of logical errors
in selecting which packets can be retransmitted.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: LUU Duc Canh <canh.d.luu@dektech.com.au>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

31c4f4cc

Merge branch 'net-sched-indirect-tc-block-cb-registration' · bb5e6a82

David S. Miller authored Nov 11, 2018

Jakub Kicinski says:

====================
net: sched: indirect tc block cb registration

John says:

This patchset introduces an alternative to egdev offload by allowing a
driver to register for block updates when an external device (e.g. tunnel
netdev) is bound to a TC block. Drivers can track new netdevs or register
to existing ones to receive information on such events. Based on this,
they may register for block offload rules using already existing
functions.

The patchset also implements this new indirect block registration in the
NFP driver to allow the offloading of tunnel rules. The use of egdev
offload (which is currently only used for tunnel offload) is subsequently
removed.

RFC v2 -> PATCH
 - removed embedded tracking function from indir block register (now up to
   driver to clean up after itself)
 - refactored NFP code due to recent submissions
 - removed priv list clean function in NFP (list should be cleared by
   indirect block unregisters)

RFC v1->v2:
 - free allocated owner struct in block_owner_clean function
 - add geneve type helper function
 - move test stub in NFP (v1 patch 2) to full tunnel offload
   implementation via indirect blocks (v2 patches 3-8)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

bb5e6a82

nfp: flower: remove unnecessary code in flow lookup · d4b69bad

John Hurley authored Nov 09, 2018

Recent changes to NFP mean that stats updates from fw to driver no longer
require a flow lookup and (because egdev offload has been removed) the
ingress netdev for a lookup is now always known.

Remove obsolete code in a flow lookup that matches on host context and
that allows for a netdev to be NULL.
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d4b69bad

nfp: flower: remove TC egdev offloads · 4f63fde3

John Hurley authored Nov 09, 2018

Previously, only tunnel decap rules required egdev registration for
offload in NFP. These are now supported via indirect TC block callbacks.

Remove the egdev code from NFP.
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4f63fde3

nfp: flower: offload tunnel decap rules via indirect TC blocks · 3166dd07

John Hurley authored Nov 09, 2018

Previously, TC block tunnel decap rules were only offloaded when a
callback was triggered through registration of the rules egress device.
This meant that the driver had no access to the ingress netdev and so
could not verify it was the same tunnel type that the rule implied.

Register tunnel devices for indirect TC block offloads in NFP, giving
access to new rules based on the ingress device rather than egress. Use
this to verify the netdev type of VXLAN and Geneve based rules and offload
the rules to HW if applicable.

Tunnel registration is done via a netdev notifier. On notifier
registration, this is triggered for already existing netdevs. This means
that NFP can register for offloads from devices that exist before it is
loaded (filter rules will be replayed from the TC core). Similarly, on
notifier unregister, a call is triggered for each currently active netdev.
This allows the driver to unregister any indirect block callbacks that may
still be active.
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3166dd07

nfp: flower: increase scope of netdev checking functions · 65b7970e

John Hurley authored Nov 09, 2018

Both the actions and tunnel_conf files contain local functions that check
the type of an input netdev. In preparation for re-use with tunnel offload
via indirect blocks, move these to static inline functions in a header
file.
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

65b7970e

nfp: flower: allow non repr netdev offload · 7885b4fc

John Hurley authored Nov 09, 2018

Previously the offload functions in NFP assumed that the ingress (or
egress) netdev passed to them was an nfp repr.

Modify the driver to permit the passing of non repr netdevs as the ingress
device for an offload rule candidate. This may include devices such as
tunnels. The driver should then base its offload decision on a combination
of ingress device and egress port for a rule.
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7885b4fc

net: sched: register callbacks for indirect tc block binds · 7f76fa36

John Hurley authored Nov 09, 2018

Currently drivers can register to receive TC block bind/unbind callbacks
by implementing the setup_tc ndo in any of their given netdevs. However,
drivers may also be interested in binds to higher level devices (e.g.
tunnel drivers) to potentially offload filters applied to them.

Introduce indirect block devs which allows drivers to register callbacks
for block binds on other devices. The callback is triggered when the
device is bound to a block, allowing the driver to register for rules
applied to that block using already available functions.

Freeing an indirect block callback will trigger an unbind event (if
necessary) to direct the driver to remove any offloaded rules and unreg
any block rule callbacks. It is the responsibility of the implementing
driver to clean any registered indirect block callbacks before exiting,
if the block it still active at such a time.

Allow registering an indirect block dev callback for a device that is
already bound to a block. In this case (if it is an ingress block),
register and also trigger the callback meaning that any already installed
rules can be replayed to the calling driver.
Signed-off-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7f76fa36

Merge branch 'PHYID-matching-macros' · d1ce0114

David S. Miller authored Nov 11, 2018

Heiner Kallweit says:

====================
net: phy: add macros for PHYID matching in PHY driver config

Add macros for PHYID matching to be used in PHY driver configs.
By using these macros some boilerplate code can be avoided.

Use them initially in the Realtek PHY drivers.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

d1ce0114

net: phy: realtek: use new PHYID matching macros · ca494936

Heiner Kallweit authored Nov 10, 2018

Use new macros for PHYID matching to avoid boilerplate code.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ca494936

net: phy: add macros for PHYID matching · aa2af2eb

Heiner Kallweit authored Nov 10, 2018

Add macros for PHYID matching to be used in PHY driver configs.
By using these macros some boilerplate code can be avoided.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

aa2af2eb

Merge branch 'phylib-simplifications' · fa28a2b2

David S. Miller authored Nov 11, 2018

Heiner Kallweit says:

====================
net: phy: further phylib simplifications after recent changes to the state machine

After the recent changes to the state machine phylib can be further
simplified (w/o having to make any assumptions).
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

fa28a2b2

net: phy: improve and inline phy_change · 34d884e3

Heiner Kallweit authored Nov 09, 2018

Now that phy_mac_interrupt() doesn't call phy_change() any longer it's
called from phy_interrupt() only. Therefore phy_interrupt_is_valid()
returns true always and the check can be removed.
In case of PHY_HALTED phy_interrupt() bails out immediately,
therefore the second check for PHY_HALTED including the call to
phy_disable_interrupts() can be removed.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

34d884e3

net: phy: simplify phy_mac_interrupt and related functions · d73a2156

Heiner Kallweit authored Nov 09, 2018

When using phy_mac_interrupt() the irq number is set to
PHY_IGNORE_INTERRUPT, therefore phy_interrupt_is_valid() returns false.
As a result phy_change() effectively just calls phy_trigger_machine()
when called from phy_mac_interrupt() via phy_change_work(). So we can
call phy_trigger_machine() from phy_mac_interrupt() directly and
remove some now unneeded code.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d73a2156

net: phy: don't set state PHY_CHANGELINK in phy_change · 8deeb630

Heiner Kallweit authored Nov 09, 2018

State PHY_CHANGELINK isn't needed here, we can call the state machine
directly. We just have to remove the check for phy_polling_mode() to
make this work also in interrupt mode. Removing this check doesn't
cause any overhead because when not polling the state machine is
called only if required by some event.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8deeb630

Merge branch 'remove-PHY_HAS_INTERRUPT' · d79e26a7

David S. Miller authored Nov 11, 2018

Heiner Kallweit says:

====================
net: phy: replace PHY_HAS_INTERRUPT with a check for config_intr and ack_interrupt

Flag PHY_HAS_INTERRUPT is used only here for this small check. I think
using interrupts isn't possible if a driver defines neither
config_intr nor ack_interrupts callback. So we can replace checking
flag PHY_HAS_INTERRUPT with checking for these callbacks.
This allows to remove this flag from all driver configs.

v2:
- add helper for check in patch 1
- remove PHY_HAS_INTERRUPT from all drivers, not only Realtek
- remove flag PHY_HAS_INTERRUPT completely

v3:
- rebase patch 2
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

d79e26a7

net: phy: remove flag PHY_HAS_INTERRUPT from driver configs · a4307c0e

Heiner Kallweit authored Nov 09, 2018

Now that flag PHY_HAS_INTERRUPT has been replaced with a check for
callbacks config_intr and ack_interrupt, we can remove setting this
flag from all driver configs.
Last but not least remove flag PHY_HAS_INTERRUPT completely.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

a4307c0e

net: phy: replace PHY_HAS_INTERRUPT with a check for config_intr and ack_interrupt · 0d2e778e

Heiner Kallweit authored Nov 09, 2018

Flag PHY_HAS_INTERRUPT is used only here for this small check. I think
using interrupts isn't possible if a driver defines neither
config_intr nor ack_interrupts callback. So we can replace checking
flag PHY_HAS_INTERRUPT with checking for these callbacks.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

0d2e778e

sctp: Fix SKB list traversal in sctp_intl_store_ordered(). · e15e067d

David S. Miller authored Nov 10, 2018

Same change as made to sctp_intl_store_reasm().

To be fully correct, an iterator has an undefined value when something
like skb_queue_walk() naturally terminates.

This will actually matter when SKB queues are converted over to
list_head.

Formalize what this code ends up doing with the current
implementation.
Signed-off-by: David S. Miller <davem@davemloft.net>

e15e067d

sctp: Fix SKB list traversal in sctp_intl_store_reasm(). · 348bbc25

David S. Miller authored Nov 10, 2018

To be fully correct, an iterator has an undefined value when something
like skb_queue_walk() naturally terminates.

This will actually matter when SKB queues are converted over to
list_head.

Formalize what this code ends up doing with the current
implementation.
Signed-off-by: David S. Miller <davem@davemloft.net>

348bbc25

iucv: Remove SKB list assumptions. · 9e733177

David S. Miller authored Aug 22, 2018

Eliminate the assumption that SKBs and SKB list heads can
be cast to eachother in SKB list handling code.

This change also appears to fix a bug since the list->next pointer is
sampled outside of holding the SKB queue lock.
Signed-off-by: David S. Miller <davem@davemloft.net>

9e733177