Commits · 6cda8b7493ac323c3b58a9a897abc0e6432d5a1d · Kirill Smelkov / linux

18 Jan, 2019 14 commits

tcp: move app_limited init to tcp_disconnect() · 6cda8b74

Eric Dumazet authored Jan 17, 2019

If we make sure all listeners have app_limited set to ~0U,
then a clone will also inherit proper initial value.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6cda8b74

tcp: move retrans_out, sacked_out, tlp_high_seq, last_oow_ack_time init to tcp_disconnect() · 5c701549

Eric Dumazet authored Jan 17, 2019

If we make sure all listeners have these fields cleared, then a clone
will also inherit zero values.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5c701549

tcp: do not clear urg_data in tcp_create_openreq_child · 5d836764

Eric Dumazet authored Jan 17, 2019

All listeners have this field cleared already, since tcp_disconnect()
clears it and newly created sockets have also a zero value here.

So a clone will inherit a zero value here.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5d836764

tcp: move snd_cwnd & snd_cwnd_cnt init to tcp_disconnect() · 3a9a57f6

Eric Dumazet authored Jan 17, 2019

Passive connections can inherit proper value by cloning,
if we make sure all listeners have the proper values there.

tcp_disconnect() was setting snd_cwnd to 2, which seems
quite obsolete since IW10 adoption.

Also remove an obsolete comment.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3a9a57f6

tcp: move mdev_us init to tcp_disconnect() · b9e2e689

Eric Dumazet authored Jan 17, 2019

If we make sure a listener always has its mdev_us
field set to TCP_TIMEOUT_INIT, we do not need to rewrite
this field after a new clone is created.

tcp_disconnect() is very seldom used in real applications.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b9e2e689

tcp: do not clear srtt_us in tcp_create_openreq_child · a0070e46

Eric Dumazet authored Jan 17, 2019

All listeners have this field cleared already, since tcp_disconnect()
clears it and newly created sockets have also a zero value here.

So a clone will inherit a zero value here.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a0070e46

tcp: do not clear packets_out in tcp_create_openreq_child() · eb2c80ca

Eric Dumazet authored Jan 17, 2019

New sockets have this field cleared, and tcp_disconnect()
calls tcp_write_queue_purge() which among other things
also clear tp->packets_out

So a listener is guaranteed to have this field cleared.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

eb2c80ca

tcp: move icsk_rto init to tcp_disconnect() · 6a408147

Eric Dumazet authored Jan 17, 2019

If we make sure a listener always has its icsk_rto
field set to TCP_TIMEOUT_INIT, we do not need to rewrite
this field after a new clone is created.

tcp_disconnect() is very seldom used in real applications.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6a408147

tcp: do not set snd_ssthresh in tcp_create_openreq_child() · b84235e2

Eric Dumazet authored Jan 17, 2019

New sockets get the field set to TCP_INFINITE_SSTHRESH in tcp_init_sock()
In case a socket had this field changed and transitions to TCP_LISTEN
state, tcp_disconnect() also makes sure snd_ssthresh is set to
TCP_INFINITE_SSTHRESH.

So a listener has this field set to TCP_INFINITE_SSTHRESH already.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b84235e2

net/mlx4: remove unneeded semicolon · bec03deb

YueHaibing authored Jan 17, 2019

Remove unneeded semicolon.
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bec03deb

net: ethernet: ti: cpsw-phy-sel: remove unneeded semicolon · 5c423d71

YueHaibing authored Jan 17, 2019

Remove unneeded semicolon.
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5c423d71

tipc: remove unneeded semicolon in trace.c · d4fb30f6

YueHaibing authored Jan 17, 2019

Remove unneeded semicolon
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d4fb30f6

qed: remove duplicated include from qed_if.h · 8b59bfe8

YueHaibing authored Jan 17, 2019

Remove duplicated include.
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Acked-by: Denis Bolotin <dbolotin@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8b59bfe8

sb1000: fix a couple of indentation issues and remove assignment in if statements · 6394d98d

Colin Ian King authored Jan 17, 2019

There is an if statement and a return statement that are incorrectly
indented. Fix these.  Also replace the assignment-in-if statements
to assignment followed by an if to keep to the coding style.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6394d98d

17 Jan, 2019 26 commits

net: add a route cache full diagnostic message · 22c2ad61

Peter Oskolkov authored Jan 16, 2019

In some testing scenarios, dst/route cache can fill up so quickly
that even an explicit GC call occasionally fails to clean it up. This leads
to sporadically failing calls to dst_alloc and "network unreachable" errors
to the user, which is confusing.

This patch adds a diagnostic message to make the cause of the failure
easier to determine.
Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

22c2ad61

dpaa2-eth: Fix ndo_stop routine · 68d74315

Ioana Ciocoi Radulescu authored Jan 16, 2019

In the current implementation, on interface down we disabled NAPI and
then manually drained any remaining ingress frames. This could lead
to a situation when, under heavy traffic, the data availability
notification for some of the channels would not get rearmed correctly.

Change the implementation such that we let all remaining ingress frames
be processed as usual and only disable NAPI once the hardware queues
are empty.

We also add a wait on the Tx side, to allow hardware time to process
all in-flight Tx frames before issueing the disable command.
Signed-off-by: Ioana Radulescu <ruxandra.radulescu@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

68d74315

wan: dscc4: fix various indentation issues · 5191673b

Colin Ian King authored Jan 16, 2019

There are some lines that have indentation issues, fix these.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5191673b

Merge branch 'vxlan-FDB-veto' · 039d52e1

David S. Miller authored Jan 17, 2019

Petr Machata says:

====================
vxlan: Allow vetoing FDB operations

mlxsw does not implement handling of the more advanced types of VXLAN
FDB entries. In order to provide visibility to users, it is important to
be able to reject such FDB entries, ideally with an explanation passed
in extended ack. This patch set implements this.

In patches #1-#4, vxlan is gradually transformed to support vetoing of
FDB entries added (or modified) through vxlan_fdb_update(), and the
default FDB entry added in __vxlan_dev_create().

Patches #5-#7 deal with vxlan_changelink(). The existing code recognizes
that vxlan_fdb_update() may fail, but doesn't attempt to keep things
intact if it does. These patches change the function in several steps to
gracefully handle vetoes (or other failures).

Then in patches #8-#11, extack arguments are added, respectively, to
ndo_fdb_add(), mlxsw's mlxsw_sp_nve_ops.fdb_replay, the functions that
connect to the VXLAN vetoing code, and call_switchdev_notifiers(). Note
that call_switchdev_blocking_notifiers() already does support extack.

Finally in patch #12, mlxsw is extended to add extack messages to
rejected FDB entries. In patch #13, the functionality is tested.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

039d52e1

selftests: mlxsw: Test veto of unsupported VXLAN FDBs · 7e1046fd

Petr Machata authored Jan 16, 2019

mlxsw doesn't implement offloading of all types of FDB entries that the
VXLAN driver supports. Test that such FDB entries are rejected. That
makes sure that the decision made by the existing validation code in
mlxsw propagates up the stack. It also exercises rollback functionality
in VXLAN, and tests that extack is returned.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7e1046fd

mlxsw: spectrum: Add extack messages to VXLAN FDB rejection · a40313d9

Petr Machata authored Jan 16, 2019

Annotate the rejections in mlxsw_sp_switchdev_vxlan_work_prepare() with
textual reasons.

Because this code ends up being invoked for FDB replay as well, drop the
default message from there, so that the more accurate error message is
not overwritten.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a40313d9

switchdev: Add extack argument to call_switchdev_notifiers() · 6685987c

Petr Machata authored Jan 16, 2019

A follow-up patch will enable vetoing of FDB entries. Make it possible
to communicate details of why an FDB entry is not acceptable back to the
user.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6685987c

vxlan: Add extack to switchdev operations · 4c59b7d1

Petr Machata authored Jan 16, 2019

There are four sources of VXLAN switchdev notifier calls:

- the changelink() link operation, which already supports extack,
- ndo_fdb_add() which got extack support in a previous patch,
- FDB updates due to packet forwarding,
- and vxlan_fdb_replay().

Extend vxlan_fdb_switchdev_call_notifiers() to include extack in the
switchdev message that it sends, and propagate the argument upwards to
the callers. For the first two cases, pass in the extack gotten through
the operation. For case #3, pass in NULL.

To cover the last case, extend vxlan_fdb_replay() to take extack
argument, which might come from whatever operation necessitated the FDB
replay.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4c59b7d1

mlxsw: Add extack to mlxsw_sp_nve_ops.fdb_replay · d907f58f

Petr Machata authored Jan 16, 2019

A follow-up patch will extend vxlan_fdb_replay() with an extack
argument. Extend the fdb_replay callback in mlxsw likewise so that the
argument is ready for the vxlan conversion.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d907f58f

net: Add extack argument to ndo_fdb_add() · 87b0984e

Petr Machata authored Jan 16, 2019

Drivers may not be able to support certain FDB entries, and an error
code is insufficient to give clear hints as to the reasons of rejection.

In order to make it possible to communicate the rejection reason, extend
ndo_fdb_add() with an extack argument. Adapt the existing
implementations of ndo_fdb_add() to take the parameter (and ignore it).
Pass the extack parameter when invoking ndo_fdb_add() from rtnl_fdb_add().
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

87b0984e

vxlan: changelink: Delete remote after update · 1cdc98c2

Petr Machata authored Jan 16, 2019

If a change in remote address prompts a change in a default FDB entry,
that change might be vetoed. If that happens, it would then be necessary
to reinstate the already-removed default FDB entry corresponding to the
previous remote address.

Instead, arrange to have the previous address removed only after the
FDB is successfully vetted.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1cdc98c2

vxlan: changelink: Postpone vxlan_config_apply() · 038a5a99

Petr Machata authored Jan 16, 2019

When an FDB entry is vetoed, it is necessary to unroll the changes that
have already been done. To avoid having to unroll vxlan_config_apply(),
postpone the call after the point where the vetoing takes place. Since
the call can't fail, it doesn't necessitate any cleanups in the
preceding FDB update logic.

Correspondingly, move down the mod_timer() call as well.

References to *dst need to be replaced with references to conf.
Additionally, old_dst and old_age_interval are not necessary anymore,
and therefore drop them.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

038a5a99

vxlan: changelink: Inline vxlan_dev_configure() · 8db9427d

Petr Machata authored Jan 16, 2019

The changelink operation may cause change in remote address, and
therefore an FDB update, which can be vetoed. To properly handle
vetoing, vxlan_changelink() needs to be gradually updated.

In this patch simply replace vxlan_dev_configure() with the two
constituent calls.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8db9427d

vxlan: Allow vetoing of FDB notifications · 61f46fe8

Petr Machata authored Jan 16, 2019

Change vxlan_fdb_switchdev_call_notifiers() to return the result from
calling switchdev notifiers. Propagate the error number up the stack.

In vxlan_fdb_update_existing() and vxlan_fdb_update_create() add
rollbacks to clean up the work that was done before the veto.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

61f46fe8

vxlan: Have vxlan_fdb_replace() save original rdst value · ccdfd4f7

Petr Machata authored Jan 16, 2019

To enable rollbacks after vetoed FDB updates, extend vxlan_fdb_replace()
to take an additional argument where it should store the original values
of a modified rdst. Update the sole caller.

The following patch will make use of the saved value.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ccdfd4f7

vxlan: Split vxlan_fdb_update() in two · a76d1ca2

Petr Machata authored Jan 16, 2019

In order to make it easier to implement rollbacks after FDB update
vetoing, separate the FDB update code to two parts: one that deals with
updates of existing FDB entries, and one that creates new entries.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a76d1ca2

vxlan: Move up vxlan_fdb_free(), vxlan_fdb_destroy() · c2b200e0

Petr Machata authored Jan 16, 2019

These functions will be needed for rollbacks of vetoed FDB entries. Move
them up so that they are visible at their intended point of use.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c2b200e0

Merge branch 'improving-TCP-behavior-on-host-congestion' · 12ff91c8

David S. Miller authored Jan 17, 2019

Yuchung Cheng says:

====================
improving TCP behavior on host congestion

This patch set aims to improve how TCP handle local qdisc congestion
by simplifying the previous implementation.  Previously when an
skb fails to (re)transmit due to local qdisc congestion or other
resource issue, TCP refrains from setting the skb timestamp or the
recovery starting time.

This design makes determining when to abort a stalling socket more
complicated, as the timestamps of these tranmission attempts were
missing. The stack needs to sort of infer when the original attempt
happens. A by-product is a socket may disregard the system timeout
limit (i.e. sysctl net.ipv4.tcp_retries2 or USER_TIMEOUT option),
and continue to retry until the transmission is successful.

In data-center environment when TCP RTO is small, this could cause
the socket to retry frequently for long during qdisc congestion.

The solution is to first unconditionally timestamp skb and recovery
attempt. Then retry more conservatively (twice a second) on local
qdisc congestion but abort the sockets according to the system limit.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

12ff91c8

tcp: less aggressive window probing on local congestion · c1d5674f

Yuchung Cheng authored Jan 16, 2019

Previously when the sender fails to send (original) data packet or
window probes due to congestion in the local host (e.g. throttling
in qdisc), it'll retry within an RTO or two up to 500ms.

In low-RTT networks such as data-centers, RTO is often far below
the default minimum 200ms. Then local host congestion could trigger
a retry storm pouring gas to the fire. Worse yet, the probe counter
(icsk_probes_out) is not properly updated so the aggressive retry
may exceed the system limit (15 rounds) until the packet finally
slips through.

On such rare events, it's wise to retry more conservatively
(500ms) and update the stats properly to reflect these incidents
and follow the system limit. Note that this is consistent with
the behaviors when a keep-alive probe or RTO retry is dropped
due to local congestion.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c1d5674f

tcp: retry more conservatively on local congestion · 590d2026

Yuchung Cheng authored Jan 16, 2019

Previously when the sender fails to retransmit a data packet on
timeout due to congestion in the local host (e.g. throttling in
qdisc), it'll retry within an RTO up to 500ms.

In low-RTT networks such as data-centers, RTO is often far
below the default minimum 200ms (and the cap 500ms). Then local
host congestion could trigger a retry storm pouring gas to the
fire. Worse yet, the retry counter (icsk_retransmits) is not
properly updated so the aggressive retry may exceed the system
limit (15 rounds) until the packet finally slips through.

On such rare events, it's wise to retry more conservatively (500ms)
and update the stats properly to reflect these incidents and follow
the system limit. Note that this is consistent with the behavior
when a keep-alive probe is dropped due to local congestion.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

590d2026

tcp: simplify window probe aborting on USER_TIMEOUT · 9721e709

Yuchung Cheng authored Jan 16, 2019

Previously we use the next unsent skb's timestamp to determine
when to abort a socket stalling on window probes. This no longer
works as skb timestamp reflects the last instead of the first
transmission.

Instead we can estimate how long the socket has been stalling
with the probe count and the exponential backoff behavior.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9721e709

tcp: create a helper to model exponential backoff · 01a523b0

Yuchung Cheng authored Jan 16, 2019

Create a helper to model TCP exponential backoff for the next patch.
This is pure refactor w no behavior change.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

01a523b0

tcp: properly track retry time on passive Fast Open · c7d13c8f

Yuchung Cheng authored Jan 16, 2019

This patch addresses a corner issue on timeout behavior of a
passive Fast Open socket.  A passive Fast Open server may write
and close the socket when it is re-trying SYN-ACK to complete
the handshake. After the handshake is completely, the server does
not properly stamp the recovery start time (tp->retrans_stamp is
0), and the socket may abort immediately on the very first FIN
timeout, instead of retying until it passes the system or user
specified limit.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c7d13c8f

tcp: always set retrans_stamp on recovery · 7ae18975

Yuchung Cheng authored Jan 16, 2019

Previously TCP socket's retrans_stamp is not set if the
retransmission has failed to send. As a result if a socket is
experiencing local issues to retransmit packets, determining when
to abort a socket is complicated w/o knowning the starting time of
the recovery since retrans_stamp may remain zero.

This complication causes sub-optimal behavior that TCP may use the
latest, instead of the first, retransmission time to compute the
elapsed time of a stalling connection due to local issues. Then TCP
may disrecard TCP retries settings and keep retrying until it finally
succeed: not a good idea when the local host is already strained.

The simple fix is to always timestamp the start of a recovery.
It's worth noting that retrans_stamp is also used to compare echo
timestamp values to detect spurious recovery. This patch does
not break that because retrans_stamp is still later than when the
original packet was sent.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7ae18975

tcp: always timestamp on every skb transmission · 7f12422c

Yuchung Cheng authored Jan 16, 2019

Previously TCP skbs are not always timestamped if the transmission
failed due to memory or other local issues. This makes deciding
when to abort a socket tricky and complicated because the first
unacknowledged skb's timestamp may be 0 on TCP timeout.

The straight-forward fix is to always timestamp skb on every
transmission attempt. Also every skb retransmission needs to be
flagged properly to avoid RTT under-estimation. This can happen
upon receiving an ACK for the original packet and the a previous
(spurious) retransmission has failed.

It's worth noting that this reverts to the old time-stamping
style before commit 8c72c65b ("tcp: update skb->skb_mstamp more
carefully") which addresses a problem in computing the elapsed time
of a stalled window-probing socket. The problem will be addressed
differently in the next patches with a simpler approach.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7f12422c

tcp: exit if nothing to retransmit on RTO timeout · 88f8598d

Yuchung Cheng authored Jan 16, 2019

Previously TCP only warns if its RTO timer fires and the
retransmission queue is empty, but it'll cause null pointer
reference later on. It's better to avoid such catastrophic failure
and simply exit with a warning.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

88f8598d