Commits · 0412bd931f5f94d1054e958415c4a945d8ee62f4 · nexedi / linux

16 Apr, 2016 2 commits

vxlan: synchronously and race-free destruction of vxlan sockets · 0412bd93

Hannes Frederic Sowa authored Apr 08, 2016

Due to the fact that the udp socket is destructed asynchronously in a
work queue, we have some nondeterministic behavior during shutdown of
vxlan tunnels and creating new ones. Fix this by keeping the destruction
process synchronous in regards to the user space process so IFF_UP can
be reliably set.

udp_tunnel_sock_release destroys vs->sock->sk if reference counter
indicates so. We expect to have the same lifetime of vxlan_sock and
vxlan_sock->sock->sk even in fast paths with only rcu locks held. So
only destruct the whole socket after we can be sure it cannot be found
by searching vxlan_net->sock_list.

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jiri Benc <jbenc@redhat.com>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

0412bd93

phy: make some bits preserved while setup forced mode · f48256ef

wangweidong authored Apr 14, 2016

When tested the PHY SGMII Loopback:
1.set the LOOPBACK bit,
2.set the autoneg to AUTONEG_DISABLE, it calls the
genphy_setup_forced which will clear the bit.

The BMCR_LOOPBACK bit should be preserved.

As Florian pointed out that other bits should be preserved too.
So I make the BMCR_ISOLATE and BMCR_PDOWN as well.
Signed-off-by: Weidong Wang <wangweidong1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f48256ef

15 Apr, 2016 38 commits

Merge branch 'sctp-diag' · 4e811b1e

David S. Miller authored Apr 15, 2016

Xin Long says:

====================
sctp: support sctp_diag in kernel

This patchset will add sctp_diag module to implement diag interface on
sctp in kernel.

For a listening sctp endpoint, we will just dump it's ep info.
For a sctp connection, we will the assoc info and it's ep info.

The ss dump will looks like:

[iproute2]# ./misc/ss --sctp  -n -l
State      Recv-Q Send-Q   Local Address:Port       Peer Address:Port
LISTEN     0      128      172.16.254.254:8888      *:*
LISTEN     0      5        127.0.0.1:1234           *:*
LISTEN     0      5        127.0.0.1:1234           *:*
  - ESTAB  0      0        127.0.0.1%lo:1234        127.0.0.1:4321
LISTEN     0      128      172.16.254.254:8888      *:*
  - ESTAB  0      0        172.16.254.254%eth1:8888 172.16.253.253:8888
  - ESTAB  0      0        172.16.254.254%eth1:8888 172.16.1.1:8888
  - ESTAB  0      0        172.16.254.254%eth1:8888 172.16.1.2:8888
  - ESTAB  0      0        172.16.254.254%eth1:8888 172.16.2.1:8888
  - ESTAB  0      0        172.16.254.254%eth1:8888 172.16.2.2:8888
  - ESTAB  0      0        172.16.254.254%eth1:8888 172.16.3.1:8888
  - ESTAB  0      0        172.16.254.254%eth1:8888 172.16.3.2:8888
LISTEN     0      0        127.0.0.1:4321           *:*
  - ESTAB  0      0        127.0.0.1%lo:4321        127.0.0.1:1234

The entries with '- ESTAB' are the assocs, some of them may belong to
the same endpoint. So we will dump the parent endpoint first, like the
entry with 'LISTEN'. then dump the assocs. ep and assocs entries will
be dumped in right order so that ss can show them in tree format easily.

Besides, this patchset also simplifies sctp proc codes, cause it has
some similar codes with sctp diag in sctp transport traversal.

v1->v2:
  1. inet_diag_get_handler needs to return it as const.
  2. merge 5/7 into 2/7 of v1.

v2->v3:
  do some improvements and fixes in patch 1-4, see the details in
  each patch's comment.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

4e811b1e

sctp: fix some rhashtable functions using in sctp proc/diag · 53fa1036

Xin Long authored Apr 14, 2016

When rhashtable_walk_init return err, no release function should be
called, and when rhashtable_walk_start return err, we should only invoke
rhashtable_walk_exit to release the source.

But now when sctp_transport_walk_start return err, we just call
rhashtable_walk_stop/exit, and never care about if rhashtable_walk_init
or start return err, which is so bad.

We will fix it by calling rhashtable_walk_exit if rhashtable_walk_start
return err in sctp_transport_walk_start, and if sctp_transport_walk_start
return err, we do not need to call sctp_transport_walk_stop any more.

For sctp proc, we will use 'iter->start_fail' to decide if we will call
rhashtable_walk_stop/exit.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

53fa1036

sctp: merge the seq_start/next/exits in remaddrs and assocs · b5e2f4e6

Xin Long authored Apr 14, 2016

In sctp proc, these three functions in remaddrs and assocs are the
same. we should merge them into one.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b5e2f4e6

sctp: add the sctp_diag.c file · 8f840e47

Xin Long authored Apr 14, 2016

This one will implement all the interface of inet_diag, inet_diag_handler.
which includes sctp_diag_dump, sctp_diag_dump_one and sctp_diag_get_info.

It will work as a module, and register inet_diag_handler when loading.

v2->v3:
- fix the mistake in inet_assoc_attr_size().

- change inet_diag_msg_laddrs_fill() name to inet_diag_msg_sctpladdrs_fill.

- change inet_diag_msg_paddrs_fill() name to inet_diag_msg_sctpaddrs_fill.

- add inet_diag_msg_sctpinfo_fill() to make asoc/ep fill code clearer.

- add inet_diag_msg_sctpasoc_fill() to make asoc fill code clearer.

- merge inet_asoc_diag_fill() and inet_ep_diag_fill() to
  inet_sctp_diag_fill().

- call sctp_diag_get_info() directly, instead by handler, cause the caller
  is in the same file with it.

- call lock_sock in sctp_tsp_dump_one() to make sure we call get sctp info
  safely.

- after lock_sock(sk), we should check sk != assoc->base.sk.

- change mem[SK_MEMINFO_WMEM_ALLOC] to asoc->sndbuf_used for asoc dump when
  asoc->ep->sndbuf_policy is set. don't use INET_DIAG_MEMINFO attr any more.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8f840e47

sctp: export some functions for sctp_diag in inet_diag · cb2050a7

Xin Long authored Apr 14, 2016

inet_diag_msg_common_fill is used to fill the diag msg common info,
we need to use it in sctp_diag as well, so export it.

inet_diag_msg_attrs_fill is used to fill some common attrs info between
sctp diag and tcp diag.

v2->v3:
- do not need to define and export inet_diag_get_handler any more.
  cause all the functions in it are in sctp_diag.ko, we just call
  them in sctp_diag.ko.

- add inet_diag_msg_attrs_fill to make codes clear.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cb2050a7

sctp: export some apis or variables for sctp_diag and reuse some for proc · 626d16f5

Xin Long authored Apr 14, 2016

For some main variables in sctp.ko, we couldn't export it to other modules,
so we have to define some api to access them.

It will include sctp transport and endpoint's traversal.

There are some transport traversal functions for sctp_diag, we can also
use it for sctp_proc. cause they have the similar situation to traversal
transport.

v2->v3:
- rhashtable_walk_init need the parameter gfp, because of recent upstrem
  update
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

626d16f5

sctp: add sctp_info dump api for sctp_diag · 52c52a61

Xin Long authored Apr 14, 2016

sctp_diag will dump some important details of sctp's assoc or ep, we use
sctp_info to describe them,  sctp_get_sctp_info to get them, and export
it to sctp_diag.ko.

v2->v3:
- we will not use list_for_each_safe in sctp_get_sctp_info, cause
  all the callers of it will use lock_sock.

- fix the holes in struct sctp_info with __reserved* field.
  because sctp_diag is a new feature, and sctp_info is just for now,
  it may be changed in the future.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

52c52a61

sctp: simplify sk_receive_queue locking · 311b2177

Marcelo Ricardo Leitner authored Apr 13, 2016

SCTP already serializes access to rcvbuf through its sock lock:
sctp_recvmsg takes it right in the start and release at the end, while
rx path will also take the lock before doing any socket processing. On
sctp_rcv() it will check if there is an user using the socket and, if
there is, it will queue incoming packets to the backlog. The backlog
processing will do the same. Even timers will do such check and
re-schedule if an user is using the socket.

Simplifying this will allow us to remove sctp_skb_list_tail and get ride
of some expensive lockings. The lists that it is used on are also
mangled with functions like __skb_queue_tail and __skb_unlink in the
same context, like on sctp_ulpq_tail_event() and sctp_clear_pd().
sctp_close() will also purge those while using only the sock lock.

Therefore the lockings performed by sctp_skb_list_tail() are not
necessary. This patch removes this function and replaces its calls with
just skb_queue_splice_tail_init() instead.

The biggest gain is at sctp_ulpq_tail_event(), because the events always
contain a list, even if it's queueing a single skb and this was
triggering expensive calls to spin_lock_irqsave/_irqrestore for every
data chunk received.

As SCTP will deliver each data chunk on a corresponding recvmsg, the
more effective the change will be.
Before this patch, with chunks with 30 bytes:
netperf -t SCTP_STREAM -H 192.168.1.2 -cC -l 60 -- -m 30 -S 400000
400000 -s 400000 400000
on a 10Gbit link with 1500 MTU:

SCTP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.1.1 () port 0 AF_INET
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB

425984 425984 30 60.00 137.45 7.34 7.36 52.504 52.608

With it:

425984 425984 30 60.00 179.10 7.97 6.70 43.740 36.788
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

311b2177

Merge branch 'mlx5_ifc-updates' · 936d4b41

David S. Miller authored Apr 15, 2016

Saeed Mahameed says:

====================
mlx5_core: mlx5_ifc updates

This series include mlx5_core updates for both net-next and rdma
trees for 4.7 kernel cycle. This is the only shared code planned
for 4.7 between rdma and net trees. Hopefully, this will prevent
future conflicts when merging between ib-next and net-next once
4.7 cycle is over and merge window is opened.

Both Mellanox rdma and net submissions will proceed once this series
is applied into both trees.

Future shared code will be sent to both maintainers as pull requests
from Mellanox's kernel.org tree.

We have included all the maintainers of respective drivers.
Kindly review the change and let us know in case of any review comments.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

936d4b41

net/mlx5: Update mlx5_ifc hardware features · 7d5e1423

Saeed Mahameed authored Apr 11, 2016

Adding the needed mlx5_ifc hardware bits and structs
for the following feature:

* Add vport to steering commands for SRIOV ACL support
* Add mlcr, pcmr and mcia registers for dump module EEPROM
* Add support for FCS, baeacon led and disable_link bits to
  hca caps
* Add CQE period mode bit in  CQ context for CQE based CQ
  moderation support
* Add umr SQ bit for fragmented memory registration
* Add needed bits and caps for Striding RQ support
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7d5e1423

net/mlx5: Fix mlx5 ifc cmd_hca_cap bad offsets · e1c9c62b

Tariq Toukan authored Apr 11, 2016

All reserved fields after early_vf_enable are off by 1, since
early_vf_enable was not explicitly declared as array of size 1.

Reserved field before cqe_zip had a wrong size, it should
be 0x80 + 0x3f.

Fixes: b0844444 ("net/mlx5_core: Introduce access function to read internal timer ")
Fixes: b4ff3a36 ("net/mlx5: Use offset based reserved field names in the IFC header file")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e1c9c62b

Merge branch 'qed-tunneling-offload' · 993feee9

David S. Miller authored Apr 15, 2016

Manish Chopra says:

====================
qed/qede: Add tunneling support

This patch series adds support for VXLAN, GRE and GENEVE tunnels
to be used over this driver. With this support, adapter can perform
TSO offload, inner/outer checksums offloads on TX and RX for
encapsulated packets.

V1->V2 [ Comments from Jesse Gross incorporated ]
* Drop general infrastructure change patch.
  "net: Make vxlan/geneve default udp ports public"
* Remove by default Linux default UDP ports configurations in driver.
  Instead, use general registration APIs for UDP port configurations
* Removing .ndo_features_check - we will add it later with proper change.

Please consider applying this series to net-next.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

993feee9

qede: Add fastpath support for tunneling · 14db81de

Manish Chopra authored Apr 14, 2016

This patch enables netdev tunneling features and adds
TX/RX fastpath support for tunneling in driver.
Signed-off-by: Manish Chopra <manish.chopra@qlogic.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: Ariel Elior <Ariel.Elior@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

14db81de

qed: Enable GRE tunnel slowpath configuration · f7985869

Manish Chopra authored Apr 14, 2016

Signed-off-by: Manish Chopra <manish.chopra@qlogic.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: Ariel Elior <Ariel.Elior@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f7985869

qed/qede: Add GENEVE tunnel slowpath configuration support · 9a109dd0

Manish Chopra authored Apr 14, 2016

This patch enables GENEVE tunnel on the adapter and
add support for driver hooks to configure UDP ports
for GENEVE tunnel offload to be performed by the adapter.
Signed-off-by: Manish Chopra <manish.chopra@qlogic.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: Ariel Elior <Ariel.Elior@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9a109dd0

qed/qede: Add VXLAN tunnel slowpath configuration support · b18e170c

Manish Chopra authored Apr 14, 2016

This patch enables VXLAN tunnel on the adapter and
add support for driver hooks to configure UDP ports
for VXLAN tunnel offload to be performed by the adapter.
Signed-off-by: Manish Chopra <manish.chopra@qlogic.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: Ariel Elior <Ariel.Elior@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b18e170c

qed: Add infrastructure support for tunneling · 464f6645

Manish Chopra authored Apr 14, 2016

This patch adds various structure/APIs needed to configure/enable different
tunnel [VXLAN/GRE/GENEVE] parameters on the adapter.
Signed-off-by: Manish Chopra <manish.chopra@qlogic.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: Ariel Elior <Ariel.Elior@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

464f6645

net/hsr: Added support for HSR v1 · ee1c2797

Peter Heise authored Apr 13, 2016

This patch adds support for the newer version 1 of the HSR
networking standard. Version 0 is still default and the new
version has to be selected via iproute2.

Main changes are in the supervision frame handling and its
ethertype field.
Signed-off-by: Peter Heise <peter.heise@airbus.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ee1c2797

Merge branch 'tcp-synflood-perf' · 125c8d12

David S. Miller authored Apr 15, 2016

Eric Dumazet says:

====================
tcp: final work on SYNFLOOD behavior

In the first patch, I remove the costly association of SYNACK+COOKIES
to a listener. I believe other parts of the stack should be ready.

The second patch removes a useless write into listener socket
in tcp_rcv_state_process(), incurring false sharing in
tcp_conn_request()

Performance under SYNFLOOD goes from 3.2 Mpps to 6 Mpps.

Test was using a single TCP listener, on a host with 8 RX queues
on the NIC, and 24 cores (48 ht)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

125c8d12

tcp: remove false sharing in tcp_rcv_state_process() · 8804b272

Eric Dumazet authored Apr 13, 2016

Last known hot point during SYNFLOOD attack is the clearing
of rx_opt.saw_tstamp in tcp_rcv_state_process()

It is not needed for a listener, so we move it where it matters.

Performance while a SYNFLOOD hits a single listener socket
went from 5 Mpps to 6 Mpps on my test server (24 cores, 8 NIC RX queues)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8804b272

tcp: do not mess with listener sk_wmem_alloc · b3d05147

Eric Dumazet authored Apr 13, 2016

When removing sk_refcnt manipulation on synflood, I missed that
using skb_set_owner_w() was racy, if sk->sk_wmem_alloc had already
transitioned to 0.

We should hold sk_refcnt instead, but this is a big deal under attack.
(Doing so increase performance from 3.2 Mpps to 3.8 Mpps only)

In this patch, I chose to not attach a socket to syncookies skb.

Performance is now 5 Mpps instead of 3.2 Mpps.

Following patch will remove last known false sharing in
tcp_rcv_state_process()

Fixes: 3b24d854 ("tcp/dccp: do not touch listener sk_refcnt under synflood")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b3d05147

qlge: Replace create_singlethread_workqueue with alloc_ordered_workqueue · ac18dd9e

Amitoj Kaur Chawla authored Apr 09, 2016

Replace deprecated create_singlethread_workqueue with
alloc_ordered_workqueue.

Work items include getting tx/rx frame sizes, resetting MPI processor,
setting asic recovery bit so ordering seems necessary as only one work
item should be in queue/executing at any given time, hence the use of
alloc_ordered_workqueue.

WQ_MEM_RECLAIM flag has been set since ethernet devices seem to sit in
memory reclaim path, so to guarantee forward progress regardless of
memory pressure.
Signed-off-by: Amitoj Kaur Chawla <amitoj1606@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

ac18dd9e

Merge branch 'tipc-link-setup-improvements' · 0818556c

David S. Miller authored Apr 15, 2016

Jon Maloy says:

====================
tipc: improvements to the link setup algorithm

This series addresses some smaller issues regarding the link setup
algorithm. The first commit fixes a rare bug we have discovered during
testing; the second one may have some future impact on cluster
scalabilty, while remaining ones can be regarded as cosmetic in
a wider sense of the word.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

0818556c

tipc: let first message on link be a state message · 34b9cd64

Jon Paul Maloy authored Apr 15, 2016

According to the link FSM, a received traffic packet can take a link
from state ESTABLISHING to ESTABLISHED, but the link can still not be
fully set up in one atomic operation. This means that even if the the
very first packet on the link is a traffic packet with sequence number
1 (one), it has to be dropped and retransmitted.

This can be avoided if we let the mentioned packet be preceded by a
LINK_PROTOCOL/STATE message, which takes up the endpoint before the
arrival of the traffic.

We add this small feature in this commit.

This is a fully compatible change.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

34b9cd64

tipc: ensure that first packets on link are sent in order · de7e07f9

Jon Paul Maloy authored Apr 15, 2016

In some link establishment scenarios we see that packet #2 may be sent
out before packet #1, forcing the receiver to demand retransmission of
the missing packet. This is harmless, but may cause confusion among
people tracing the packet flow.

Since this is extremely easy to fix, we do so by adding en extra send
call to the bearer immediately after the link has come up.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

de7e07f9

tipc: refactor function tipc_link_timeout() · 42b18f60

Jon Paul Maloy authored Apr 15, 2016

The function tipc_link_timeout() is unnecessary complex, and can
easily be made more readable.

We do that with this commit. The only functional change is that we
remove a redundant test for whether the broadcast link is up or not.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

42b18f60

tipc: reduce transmission rate of reset messages when link is down · 88e8ac70

Jon Paul Maloy authored Apr 15, 2016

When a link is down, it will continuously try to re-establish contact
with the peer by sending out a RESET or an ACTIVATE message at each
timeout interval. The default value for this interval is currently
375 ms. This is wasteful, and may become a problem in very large
clusters with dozens or hundreds of nodes being down simultaneously.

We now introduce a simple backoff algorithm for these cases. The
first five messages are sent at default rate; thereafter a message
is sent only each 16th timer interval.

This will cover the vast majority of link recycling cases, since the
endpoint starting last will transmit at the higher speed, and the link
should normally be established well be before the rate needs to be
reduced.

The only case where we will see a degradation of link re-establishment
times is when the endpoints remain intact, and a glitch in the
transmission media is causing the link reset. We will then experience
a worst-case re-establishing time of 6 seconds, something we deem
acceptable.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

88e8ac70

tipc: guarantee peer bearer id exchange after reboot · 634696b1

Jon Paul Maloy authored Apr 15, 2016

When a link endpoint is going down locally, e.g., because its interface
is being stopped, it will spontaneously send out a RESET message to
its peer, informing it about this fact. This saves the peer from
detecting the failure via probing, and hence gives both speedier and
less resource consuming failure detection on the peer side.

According to the link FSM, a receiver of a RESET message, ignoring the
reason for it, must now consider the sender ready to come back up, and
starts periodically sending out ACTIVATE messages to the peer in order
to re-establish the link. Also, according to the FSM, the receiver of
an ACTIVATE message can now go directly to state ESTABLISHED and start
sending regular traffic packets. This is a well-proven and robust FSM.

However, in the case of a reboot, there is a small possibilty that link
endpoint on the rebooted node may have been re-created with a new bearer
identity between the moment it sent its (pre-boot) RESET and the moment
it receives the ACTIVATE from the peer. The new bearer identity cannot
be known by the peer according to this scenario, since traffic headers
don't convey such information. This is a problem, because both endpoints
need to know the correct value of the peer's bearer id at any moment in
time in order to be able to produce correct link events for their users.

The only way to guarantee this is to enforce a full setup message
exchange (RESET + ACTIVATE) even after the reboot, since those messages
carry the bearer idientity in their header.

In this commit we do this by introducing and setting a "stopping" bit in
the header of the spontaneously generated RESET messages, informing the
peer that the sender will not be immediately ready to re-establish the
link. A receiver seeing this bit must act as if this were a locally
detected connectivity failure, and hence has to go through a full two-
way setup message exchange before any link can be re-established.

Although never reported, this problem seems to have always been around.

This protocol addition is fully backwards compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

634696b1

Merge branch 'mlxsw-next' · 25fb0b6c

David S. Miller authored Apr 15, 2016

Jiri Pirko says:

====================
mlxsw: spectrum_buffers: couple of cosmetic patches

As suggested by David Laight
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

25fb0b6c

mlxsw: spectrum_buffers: Use MLXSW_SP_PB_UNUSED define for unused pb · b94cdabb

Jiri Pirko authored Apr 15, 2016

Suggested-by: David Laight <David.Laight@ACULAB.COM>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b94cdabb

mlxsw: spectrum_buffers: Use designated initializers for mlxsw_sp_pbs · ce78f020

Jiri Pirko authored Apr 15, 2016

Suggested-by: David Laight <David.Laight@ACULAB.COM>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ce78f020

devlink: fix sb register stub in case devlink is disabled · de33efd0

Jiri Pirko authored Apr 15, 2016

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Fixes: bf797471 ("devlink: add shared buffer configuration")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

de33efd0

tun: use per cpu variables for stats accounting · 608b9977

Paolo Abeni authored Apr 13, 2016

Currently the tun device accounting uses dev->stats without applying any
kind of protection, regardless that accounting happens in preemptible
process context.
This patch move the tun stats to a per cpu data structure, and protect
the updates with  u64_stats_update_begin()/u64_stats_update_end() or
this_cpu_inc according to the stat type. The per cpu stats are
aggregated by the newly added ndo_get_stats64 ops.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

608b9977

Merge branch 'bpf-ARG_PTR_TO_RAW_STACK' · 548aacdd

David S. Miller authored Apr 14, 2016

Merge branch 'bpf-ARG_PTR_TO_RAW_STACK'

Daniel Borkmann says:

====================
BPF updates

This series adds a new verifier argument type called
ARG_PTR_TO_RAW_STACK and converts related helpers to make
use of it. Basic idea is that we can save init of stack
memory when the helper function is guaranteed to fully
fill out the passed buffer in every path. Series also adds
test cases and converts samples. For more details, please
see individual patches.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

548aacdd

bpf, samples: add test cases for raw stack · 3f2050e2

Daniel Borkmann authored Apr 13, 2016

This adds test cases mostly around ARG_PTR_TO_RAW_STACK to check the
verifier behaviour.

  [...]
  #84 raw_stack: no skb_load_bytes OK
  #85 raw_stack: skb_load_bytes, no init OK
  #86 raw_stack: skb_load_bytes, init OK
  #87 raw_stack: skb_load_bytes, spilled regs around bounds OK
  #88 raw_stack: skb_load_bytes, spilled regs corruption OK
  #89 raw_stack: skb_load_bytes, spilled regs corruption 2 OK
  #90 raw_stack: skb_load_bytes, spilled regs + data OK
  #91 raw_stack: skb_load_bytes, invalid access 1 OK
  #92 raw_stack: skb_load_bytes, invalid access 2 OK
  #93 raw_stack: skb_load_bytes, invalid access 3 OK
  #94 raw_stack: skb_load_bytes, invalid access 4 OK
  #95 raw_stack: skb_load_bytes, invalid access 5 OK
  #96 raw_stack: skb_load_bytes, invalid access 6 OK
  #97 raw_stack: skb_load_bytes, large access OK
  Summary: 98 PASSED, 0 FAILED
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

3f2050e2

bpf, samples: don't zero data when not needed · 02413cab

Daniel Borkmann authored Apr 13, 2016

Remove the zero initialization in the sample programs where appropriate.
Note that this is an optimization which is now possible, old programs
still doing the zero initialization are just fine as well. Also, make
sure we don't have padding issues when we don't memset() the entire
struct anymore.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

02413cab

bpf: convert relevant helper args to ARG_PTR_TO_RAW_STACK · 074f528e

Daniel Borkmann authored Apr 13, 2016

This patch converts all helpers that can use ARG_PTR_TO_RAW_STACK as argument
type. For tc programs this is bpf_skb_load_bytes(), bpf_skb_get_tunnel_key(),
bpf_skb_get_tunnel_opt(). For tracing, this optimizes bpf_get_current_comm()
and bpf_probe_read(). The check in bpf_skb_load_bytes() for MAX_BPF_STACK can
also be removed since the verifier already makes sure we stay within bounds
on stack buffers.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

074f528e

bpf, verifier: add ARG_PTR_TO_RAW_STACK type · 435faee1

Daniel Borkmann authored Apr 13, 2016

When passing buffers from eBPF stack space into a helper function, we have
ARG_PTR_TO_STACK argument type for helpers available. The verifier makes sure
that such buffers are initialized, within boundaries, etc.

However, the downside with this is that we have a couple of helper functions
such as bpf_skb_load_bytes() that fill out the passed buffer in the expected
success case anyway, so zero initializing them prior to the helper call is
unneeded/wasted instructions in the eBPF program that can be avoided.

Therefore, add a new helper function argument type called ARG_PTR_TO_RAW_STACK.
The idea is to skip the STACK_MISC check in check_stack_boundary() and color
the related stack slots as STACK_MISC after we checked all call arguments.

Helper functions using ARG_PTR_TO_RAW_STACK must make sure that every path of
the helper function will fill the provided buffer area, so that we cannot leak
any uninitialized stack memory. This f.e. means that error paths need to
memset() the buffers, but the expected fast-path doesn't have to do this
anymore.

Since there's no such helper needing more than at most one ARG_PTR_TO_RAW_STACK
argument, we can keep it simple and don't need to check for multiple areas.
Should in future such a use-case really appear, we have check_raw_mode() that
will make sure we implement support for it first.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

435faee1