Commits · f7aa74e483e81c7a064ebc29e5deeb6b31cde412 · Kirill Smelkov / linux

24 Sep, 2018 7 commits

neighbour: allow admin to set NTF_ROUTER · f7aa74e4

Roopa Prabhu authored Sep 22, 2018

This patch allows admin setting of NTF_ROUTER flag
on a neighbour entry. This enables external control
plane (like bgp evpn) to manage neigh entries with
NTF_ROUTER flag.
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f7aa74e4

Merge branch 'net-sched-Add-hardware-specific-counters-to-TC-actions' · ea49c6f0

David S. Miller authored Sep 24, 2018

Eelco Chaudron says:

====================
net/sched: Add hardware specific counters to TC actions

Add hardware specific counters to TC actions which will be exported
through the netlink API. This makes troubleshooting TC flower offload
easier, as it possible to differentiate the packets being offloaded.

v2 - Rebased on latest net-next
====================
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ea49c6f0

net/sched: Add hardware specific counters to TC actions · 28169aba

Eelco Chaudron authored Sep 21, 2018

Add additional counters that will store the bytes/packets processed by
hardware. These will be exported through the netlink interface for
displaying by the iproute2 tc tool
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

28169aba

net/core: Add new basic hardware counter · 5e111210

Eelco Chaudron authored Sep 21, 2018

Add a new hardware specific basic counter, TCA_STATS_BASIC_HW. This can
be used to count packets/bytes processed by hardware offload.
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5e111210

Merge branch 'mvpp2-Add-txq-to-CPU-mapping' · 7ff2ea0b

David S. Miller authored Sep 24, 2018

Maxime Chevallier says:

====================
net: mvpp2: Add txq to CPU mapping

This short series adds XPS support to the mvpp2 driver, by mapping
txqs and CPUs. This comes with a patch using round-robin scheduling
for the HW to pick the next txq to transmit from, instead of the default
fixed-priority scheduling.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

7ff2ea0b

net: mvpp2: use round-robin scheduling for TX queues on the same CPU · 4251ea5b

Maxime Chevallier authored Sep 24, 2018

This commit allows each TXQ to be picked in a round-robin fashion by
the PPv2 transmit scheduling mechanism. This is opposed to the default
behaviour that prioritizes the highest numbered queues.
Suggested-by: Yan Markman <ymarkman@marvell.com>
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4251ea5b

net: mvpp2: support XPS by mapping TX queues to CPUs · 0d283ab5

Maxime Chevallier authored Sep 24, 2018

Since the PPv2 controller has multiple TX queues, we can spread traffic
by assining TX queues to CPUs, allowing to use XPS to balance egress
traffic between CPUs.

Suggested-by : Yan Markman <ymarkman@marvell.com>
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0d283ab5

23 Sep, 2018 5 commits

mlxsw: Make MLXSW_SP1_FWREV_MINOR a hard requirement · 12ba7e10

Petr Machata authored Sep 23, 2018

Up until now, mlxsw tolerated firmware versions that weren't exactly
matching the required version, if the branch number matched. That
allowed the users to test various firmware versions as long as they were
on the right branch.

On the other hand, it made it impossible for mlxsw to put a hard lower
bound on a version that fixes all problems known to date. If a user had
a somewhat older FW version installed, mlxsw would start up just fine,
possibly performing non-optimally as it would use features that trigger
problematic behavior.

Therefore tweak the check to accept any FW version that is:

- on the same branch as the preferred version, and
- the same as or newer than the preferred version.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

12ba7e10

Merge branch 'hv_netvsc-Support-LRO-RSC-in-the-vSwitch' · 739d0def

David S. Miller authored Sep 22, 2018

Haiyang Zhang says:

====================
hv_netvsc: Support LRO/RSC in the vSwitch

The patch adds support for LRO/RSC in the vSwitch feature. It reduces
the per packet processing overhead by coalescing multiple TCP segments
when possible. The feature is enabled by default on VMs running on
Windows Server 2019 and later.

The patch set also adds ethtool command handler and documents.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

739d0def

hv_netvsc: Update document for LRO/RSC support · f1951c22

Haiyang Zhang authored Sep 21, 2018

Update document for LRO/RSC support, and the command line info to
change the setting.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f1951c22

hv_netvsc: Add handler for LRO setting change · d6792a5a

Haiyang Zhang authored Sep 21, 2018

This patch adds the handler for LRO setting change, so that a user
can use ethtool command to enable / disable LRO feature.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d6792a5a

hv_netvsc: Add support for LRO/RSC in the vSwitch · c8e4eff4

Haiyang Zhang authored Sep 21, 2018

LRO/RSC in the vSwitch is a feature available in Windows Server 2019
hosts and later. It reduces the per packet processing overhead by
coalescing multiple TCP segments when possible. This patch adds netvsc
driver support for this feature.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c8e4eff4

22 Sep, 2018 28 commits

Merge branch 'net-dsa-b53-SGMII-modes-fixes' · bd4d08da

David S. Miller authored Sep 21, 2018

Florian Fainelli says:

====================
net: dsa: b53: SGMII modes fixes

Here are two additional fixes that are required in order for SGMII to
work correctly. This was discovered with using a copper SFP which would
make us use SGMII mode, we would actually leave the HW configured in its
default mode: Fiber.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

bd4d08da

net: dsa: b53: Also include SGMII for mac_config and mac_link_state · 55a4d2ea

Florian Fainelli authored Sep 21, 2018

In both 802.3z and SGMII modes we need to configure the MAC accordingly
to flip between Fiber and SGMII modes, and we need to read the MAC
status from the SGMII in-band control word.

Fixes: 0e01491d ("net: dsa: b53: Add SerDes support")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

55a4d2ea

net: dsa: b53: Fix B53_SERDES_DIGITAL_CONTROL offset · 2cae8c07

Florian Fainelli authored Sep 21, 2018

Maths went wrong, to get 0x20, we need to do 0x1e + (x) * 2, not 0x18,
fix that offset so we access the correct registers. This would make us
not access the correct SerDes Digital control words, status would be
fine and so we would not be correctly flipping between Fiber and SGMII
modes resulting in incorrect status words being pulled into the SerDes
digital status register.

Fixes: 0e01491d ("net: dsa: b53: Add SerDes support")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2cae8c07

net: dsa: b53: Don't assign autonegotiation enabled · e24cf6b3

Florian Fainelli authored Sep 21, 2018

PHYLINK takes care of filing the right information into
state->an_enabled, get rid of the read from the SerDes's BMCR register.

Fixes: 0e01491d ("net: dsa: b53: Add SerDes support")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e24cf6b3

decnet: Remove unnecessary check for dev->name · 5b9b0a80

Nathan Chancellor authored Sep 21, 2018

Clang warns that the address of a pointer will always evaluated as true
in a boolean context.

net/decnet/dn_dev.c:1366:10: warning: address of array 'dev->name' will
always evaluate to 'true' [-Wpointer-bool-conversion]
                                dev->name ? dev->name : "???",
                                ~~~~~^~~~ ~
1 warning generated.

Link: https://github.com/ClangBuiltLinux/linux/issues/116Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

5b9b0a80

selftests/net: add ipv6 tests to ip_defrag selftest · bccc1711

Peter Oskolkov authored Sep 21, 2018

This patch adds ipv6 defragmentation tests to ip_defrag selftest,
to complement existing ipv4 tests.
Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bccc1711

net/ipfrag: let ip[6]frag_high_thresh in ns be higher than in init_net · 83619623

Peter Oskolkov authored Sep 21, 2018

Currently, ip[6]frag_high_thresh sysctl values in new namespaces are
hard-limited to those of the root/init ns.

There are at least two use cases when it would be desirable to
set the high_thresh values higher in a child namespace vs the global hard
limit:

- a security/ddos protection policy may lower the thresholds in the
  root/init ns but allow for a special exception in a child namespace
- testing: a test running in a namespace may want to set these
  thresholds higher in its namespace than what is in the root/init ns

The new behavior:

 # ip netns add testns
 # ip netns exec testns bash

 # sysctl -w net.ipv4.ipfrag_high_thresh=9000000
 net.ipv4.ipfrag_high_thresh = 9000000

 # sysctl net.ipv4.ipfrag_high_thresh
 net.ipv4.ipfrag_high_thresh = 9000000

 # sysctl -w net.ipv6.ip6frag_high_thresh=9000000
 net.ipv6.ip6frag_high_thresh = 9000000

 # sysctl net.ipv6.ip6frag_high_thresh
 net.ipv6.ip6frag_high_thresh = 9000000

The old behavior:

 # ip netns add testns
 # ip netns exec testns bash

 # sysctl -w net.ipv4.ipfrag_high_thresh=9000000
 net.ipv4.ipfrag_high_thresh = 9000000

 # sysctl net.ipv4.ipfrag_high_thresh
 net.ipv4.ipfrag_high_thresh = 4194304

 # sysctl -w net.ipv6.ip6frag_high_thresh=9000000
 net.ipv6.ip6frag_high_thresh = 9000000

 # sysctl net.ipv6.ip6frag_high_thresh
 net.ipv6.ip6frag_high_thresh = 4194304
Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

83619623

ipv6: discard IP frag queue on more errors · 2475f59c

Peter Oskolkov authored Sep 21, 2018

This is similar to how ipv4 now behaves:
commit 0ff89efb ("ip: fail fast on IP defrag errors").
Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2475f59c

net/ipv4: avoid compile error in fib_info_nh_uses_dev · 075e264f

Eric Dumazet authored Sep 21, 2018

net/ipv4/fib_frontend.c: In function 'fib_info_nh_uses_dev':
net/ipv4/fib_frontend.c:322:6: error: unused variable 'ret' [-Werror=unused-variable]
cc1: all warnings being treated as errors

Fixes: 78f2756c ("net/ipv4: Move device validation to helper")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: David Ahern <dsahern@gmail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

075e264f

Merge branch 'tcp-switch-to-Early-Departure-Time-model' · a88e24f2

David S. Miller authored Sep 21, 2018

Eric Dumazet says:

====================
tcp: switch to Early Departure Time model

In the early days, pacing has been implemented in sch_fq (FQ)
in a generic way :

- SO_MAX_PACING_RATE could be used by any sockets.

- TCP would vary effective pacing rate based on CWND*MSS/SRTT

- FQ would ensure delays between packets based on current
  sk->sk_pacing_rate, but with some quantum based artifacts.
  (inflating RPC tail latencies)

- BBR then tweaked the pacing rate in its various phases
  (PROBE, DRAIN, ...)

This worked reasonably well, but had the side effect that TCP RTT
samples would be inflated by the sojourn time of the packets in FQ.

Also note that when FQ is not used and TCP wants pacing, the
internal pacing fallback has very different behavior, since TCP
emits packets at the time they should be sent (with unreasonable
assumptions about scheduling costs)

Van Jacobson gave a talk at Netdev 0x12 in Montreal, about letting
TCP (or applications for UDP messages) decide of the Earliest
Departure Time, instead of letting packet schedulers derive it
from pacing rate.

https://www.netdevconf.org/0x12/session.html?evolving-from-afap-teaching-nics-about-time
https://www.files.netdevconf.org/d/46def75c2ef345809bbe/files/?p=/Evolving%20from%20AFAP%20%E2%80%93%20Teaching%20NICs%20about%20time.pdf

Recent additions in linux provided SO_TXTIME and a new ETF qdisc
supporting the new skb->tstamp role

This patch series converts TCP and FQ to the same model.

This might in the future allow us to relax tight TSQ limits
(if FQ is present in the output path), and thus lower
number of callbacks to tcp_write_xmit(), thanks to batching.

This will be followed by FQ change allowing SO_TXTIME support
so that QUIC servers can let the pacing being done in FQ (or
offloaded if network device permits)

For example, a TCP flow rated at 24Mbps now shows a more meaningful RTT

Before :

ESTAB  0  211408 10.246.7.151:41558   10.246.7.152:33723
	 cubic wscale:8,8 rto:203 rtt:2.195/0.084 mss:1448 rcvmss:536
  advmss:1448 cwnd:20 ssthresh:20 bytes_acked:36897937
  segs_out:25488 segs_in:12454 data_segs_out:25486
  send 105.5Mbps lastsnd:1 lastrcv:12851 lastack:1
  pacing_rate 24.0Mbps/24.0Mbps delivery_rate 22.9Mbps
  busy:12851ms unacked:4 rcv_space:29200 notsent:205616 minrtt:0.026

After :

ESTAB  0  192584 10.246.7.151:61612   10.246.7.152:34375
	 cubic wscale:8,8 rto:201 rtt:0.165/0.129 mss:1448 rcvmss:536
  advmss:1448 cwnd:20 ssthresh:20 bytes_acked:170755401
  segs_out:117931 segs_in:57651 data_segs_out:117929
  send 1404.1Mbps lastsnd:1 lastrcv:56915 lastack:1
  pacing_rate 24.0Mbps/24.0Mbps delivery_rate 24.2Mbps
  busy:56915ms unacked:4 rcv_space:29200 notsent:186792 minrtt:0.054

A nice side effect of this patch series is a reduction of max/p99
latencies of RPC workloads, since the FQ quantum no longer adds
artifact.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

a88e24f2

net_sched: sch_fq: remove dead code dealing with retransmits · 90caf67b

Eric Dumazet authored Sep 21, 2018

With the earliest departure time model, we no longer plan
special casing TCP retransmits. We therefore remove dead
code (since most compilers understood skb_is_retransmit()
was false)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

90caf67b

tcp: switch tcp_internal_pacing() to tcp_wstamp_ns · c092dd5f

Eric Dumazet authored Sep 21, 2018

Now TCP keeps track of tcp_wstamp_ns, recording the earliest
departure time of next packet, we can remove duplicate code
from tcp_internal_pacing()

This removes one ktime_get_tai_ns() call, and a divide.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c092dd5f

tcp: switch tcp and sch_fq to new earliest departure time model · ab408b6d

Eric Dumazet authored Sep 21, 2018

TCP keeps track of tcp_wstamp_ns by itself, meaning sch_fq
no longer has to do it.

Thanks to this model, TCP can get more accurate RTT samples,
since pacing no longer inflates them.

This has the nice effect of removing some delays caused by FQ
quantum mechanism, causing inflated max/P99 latencies.

Also we might relax TCP Small Queue tight limits in the future,
since this new model allow TCP to build bigger batches, since
sch_fq (or a device with earliest departure time offload) ensure
these packets will be delivered on time.

Note that other protocols are not converted (they will probably
never be) so sch_fq has still support for SO_MAX_PACING_RATE

Tested:

Test showing FQ pacing quantum artifact for low-rate flows,
adding unexpected throttles for RPC flows, inflating max and P99 latencies.

The parameters chosen here are to show what happens typically when
a TCP flow has a reduced pacing rate (this can be caused by a reduced
cwin after few losses, or/and rtt above few ms)

MIBS="MIN_LATENCY,MEAN_LATENCY,MAX_LATENCY,P99_LATENCY,STDDEV_LATENCY"
Before :
$ netperf -H 10.246.7.133 -t TCP_RR -Cc -T6,6 -- -q 2000000 -r 100,100 -o $MIBS
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.133 () port 0 AF_INET : first burst 0 : cpu bind
 Minimum Latency Microseconds,Mean Latency Microseconds,Maximum Latency Microseconds,99th Percentile Latency Microseconds,Stddev Latency Microseconds
19,82.78,5279,3825,482.02

After :
$ netperf -H 10.246.7.133 -t TCP_RR -Cc -T6,6 -- -q 2000000 -r 100,100 -o $MIBS
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.133 () port 0 AF_INET : first burst 0 : cpu bind
Minimum Latency Microseconds,Mean Latency Microseconds,Maximum Latency Microseconds,99th Percentile Latency Microseconds,Stddev Latency Microseconds
20,49.94,128,63,3.18
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ab408b6d

tcp: switch internal pacing timer to CLOCK_TAI · fd2bca2a

Eric Dumazet authored Sep 21, 2018

Next patch will use tcp_wstamp_ns to feed internal
TCP pacing timer, so switch to CLOCK_TAI to share same base.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fd2bca2a

tcp: provide earliest departure time in skb->tstamp · d3edd06e

Eric Dumazet authored Sep 21, 2018

Switch internal TCP skb->skb_mstamp to skb->skb_mstamp_ns,
from usec units to nsec units.

Do not clear skb->tstamp before entering IP stacks in TX,
so that qdisc or devices can implement pacing based on the
earliest departure time instead of socket sk->sk_pacing_rate

Packets are fed with tcp_wstamp_ns, and following patch
will update tcp_wstamp_ns when both TCP and sch_fq switch to
the earliest departure time mechanism.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d3edd06e

tcp: add tcp_wstamp_ns socket field · 9799ccb0

Eric Dumazet authored Sep 21, 2018

TCP will soon provide earliest departure time on TX skbs.
It needs to track this in a new variable.

tcp_mstamp_refresh() needs to update this variable, and
became too big to stay an inline.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9799ccb0

net_sched: sch_fq: switch to CLOCK_TAI · 142537e4

Eric Dumazet authored Sep 21, 2018

TCP will soon provide per skb->tstamp with earliest departure time,
so that sch_fq does not have to determine departure time by looking
at socket sk_pacing_rate.

We chose in linux-4.19 CLOCK_TAI as the clock base for transports,
qdiscs, and NIC offloads.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

142537e4

tcp: introduce tcp_skb_timestamp_us() helper · 2fd66ffb

Eric Dumazet authored Sep 21, 2018

There are few places where TCP reads skb->skb_mstamp expecting
a value in usec unit.

skb->tstamp (aka skb->skb_mstamp) will soon store CLOCK_TAI nsec value.

Add tcp_skb_timestamp_us() to provide proper conversion when needed.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2fd66ffb

tcp: switch tcp_clock_ns() to CLOCK_TAI base · 72b0094f

Eric Dumazet authored Sep 21, 2018

TCP pacing is either implemented in sch_fq or internally.
We have the goal of being able to offload pacing on the NICS.

TCP will soon provide per skb skb->tstamp as early departure time.

Like ETF in commit 25db26a9 ("net/sched: Introduce the ETF Qdisc")
we chose CLOCK_T as the clock base, so that TCP and pacers can share
a common clock, to get better RTT samples (without pacing artificially
inflating these samples).
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

72b0094f

Merge branch 'hns3-next' · 4f4b93a8

David S. Miller authored Sep 21, 2018

Salil Mehta says:

====================
Bug fixes, snall modifications & cleanup for HNS3 driver

This patch presents some bug fixes, small modifications and cleanups
to the HNS3 VF and PF driver.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

4f4b93a8

net: hns3: Remove redundant hclge_get_port_type() · ebfefb8a

Peng Li authored Sep 21, 2018

This patch removes hclge_get_port_type which is redundant.
Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ebfefb8a

net: hns3: Fix speed/duplex information loss problem when executing ethtool ethx cmd of VF · 5f373b15

Fuyun Liang authored Sep 21, 2018

Our VF has not implemented the ops for get_port_type. So when we executing
ethtool ethx cmd of VF, hns3_get_link_ksettings will return directly. And
we can not query anything.

To support get_link_ksettings for VF, this patch replaces get_port_type
with get_media_type. If the media type is HNAE3_MEDIA_TYPE_NONE,
hns3_get_link_ksettings will return link information of VF.

Fixes: 12f46bc1 ("net: hns3: Refine hns3_get_link_ksettings()")
Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5f373b15

net: hns3: Add get_media_type ops support for VF · c136b884

Peng Li authored Sep 21, 2018

This patch adds the ops of get_media_type support for VF.
Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c136b884