Commits · e3e17b773bfe45462b7f3fae20c550025975cb13 · nexedi / linux

06 Feb, 2016 33 commits

tcp: fastopen: call tcp_fin() if FIN present in SYNACK · e3e17b77

Eric Dumazet authored Feb 06, 2016

When we acknowledge a FIN, it is not enough to ack the sequence number
and queue the skb into receive queue. We also have to call tcp_fin()
to properly update socket state and send proper poll() notifications.

It seems we also had the problem if we received a SYN packet with the
FIN flag set, but it does not seem an urgent issue, as no known
implementation can do that.

Fixes: 61d2bcae ("tcp: fastopen: accept data/FIN present in SYNACK message")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e3e17b77

Merge branch 'tipc-topology-updates' · 9a23ac47

David S. Miller authored Feb 06, 2016

Parthasarathy Bhuvaragan says:

====================
tipc: cleanups, fixes & improvements for topology server

This series contains topology server cleanups, fixes and improvements.

Cleanups in #1-#4:
We remove duplicate data structures and aligin the rest of the code accordingly.

Fixes in #5-#8:
The bugs occur either during configuration or while running on SMP targets,
which are race conditions that pop up under different situations.

Improvements in #9-#10:
Updates to decrease timer usage and improve readability.

v2: Updated commit message in patch 6 based on feedback from
    Sergei Shtylyov sergei.shtylyov@cogentembedded.com
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

9a23ac47

tipc: use alloc_ordered_workqueue() instead of WQ_UNBOUND w/ max_active = 1 · 06c8581f

Parthasarathy Bhuvaragan authored Feb 02, 2016

Until now, tipc_rcv and tipc_send workqueues in server are allocated
with parameters WQ_UNBOUND & max_active = 1.
This parameters passed to this function makes it equivalent to
alloc_ordered_workqueue(). The later form is more explicit and
can inherit future ordered_workqueue changes.

In this commit we replace alloc_workqueue() with more readable
alloc_ordered_workqueue().
Acked-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

06c8581f

tipc: donot create timers if subscription timeout = TIPC_WAIT_FOREVER · ae245557

Parthasarathy Bhuvaragan authored Feb 02, 2016

Until now, we create timers even for the subscription requests
with timeout = TIPC_WAIT_FOREVER.
This can be improved by avoiding timer creation when the timeout
is set to TIPC_WAIT_FOREVER.

In this commit, we introduce a check to creates timers only
when timeout != TIPC_WAIT_FOREVER.
Acked-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ae245557

tipc: protect tipc_subscrb_get() with subscriber spin lock · f3ad288c

Parthasarathy Bhuvaragan authored Feb 02, 2016

Until now, during subscription creation the mod_time() &
tipc_subscrb_get() are called after releasing the subscriber
spin lock.

In a SMP system when performing a subscription creation, if the
subscription timeout occurs simultaneously (the timer is
scheduled to run on another CPU) then the timer thread
might decrement the subscribers refcount before the create
thread increments the refcount.

This can be simulated by creating subscription with timeout=0 and
sometimes the timeout occurs before the create request is complete.
This leads to the following message:
[30.702949] BUG: spinlock bad magic on CPU#1, kworker/u8:3/87
[30.703834] general protection fault: 0000 [#1] SMP
[30.704826] CPU: 1 PID: 87 Comm: kworker/u8:3 Not tainted 4.4.0-rc8+ #18
[30.704826] Workqueue: tipc_rcv tipc_recv_work [tipc]
[30.704826] task: ffff88003f878600 ti: ffff88003fae0000 task.ti: ffff88003fae0000
[30.704826] RIP: 0010:[<ffffffff8109196c>]  [<ffffffff8109196c>] spin_dump+0x5c/0xe0
[...]
[30.704826] Call Trace:
[30.704826]  [<ffffffff81091a16>] spin_bug+0x26/0x30
[30.704826]  [<ffffffff81091b75>] do_raw_spin_lock+0xe5/0x120
[30.704826]  [<ffffffff81684439>] _raw_spin_lock_bh+0x19/0x20
[30.704826]  [<ffffffffa0096f10>] tipc_subscrb_rcv_cb+0x1d0/0x330 [tipc]
[30.704826]  [<ffffffffa00a37b1>] tipc_receive_from_sock+0xc1/0x150 [tipc]
[30.704826]  [<ffffffffa00a31df>] tipc_recv_work+0x3f/0x80 [tipc]
[30.704826]  [<ffffffff8106a739>] process_one_work+0x149/0x3c0
[30.704826]  [<ffffffff8106aa16>] worker_thread+0x66/0x460
[30.704826]  [<ffffffff8106a9b0>] ? process_one_work+0x3c0/0x3c0
[30.704826]  [<ffffffff8106a9b0>] ? process_one_work+0x3c0/0x3c0
[30.704826]  [<ffffffff8107029d>] kthread+0xed/0x110
[30.704826]  [<ffffffff810701b0>] ? kthread_create_on_node+0x190/0x190
[30.704826]  [<ffffffff81684bdf>] ret_from_fork+0x3f/0x70

In this commit,
1. we remove the check for the return code for mod_timer()
2. we protect tipc_subscrb_get() using the subscriber spin lock.
   We increment the subscriber's refcount as soon as we add the
   subscription to subscriber's subscription list.
Acked-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f3ad288c

tipc: hold subscriber->lock for tipc_nametbl_subscribe() · d4091899

Parthasarathy Bhuvaragan authored Feb 02, 2016

Until now, while creating a subscription the subscriber lock
protects only the subscribers subscription list and not the
nametable. The call to tipc_nametbl_subscribe() is outside
the lock. However, at subscription timeout and cancel both
the subscribers subscription list and the nametable are
protected by the subscriber lock.

This asymmetric locking mechanism leads to the following problem:
In a SMP system, the timer can be fire on another core before
the create request is complete.
When the timer thread calls tipc_nametbl_unsubscribe() before create
thread calls tipc_nametbl_subscribe(), we get a nullptr exception.

This can be simulated by creating subscription with timeout=0 and
sometimes the timeout occurs before the create request is complete.

The following is the oops:
[57.569661] BUG: unable to handle kernel NULL pointer dereference at (null)
[57.577498] IP: [<ffffffffa02135aa>] tipc_nametbl_unsubscribe+0x8a/0x120 [tipc]
[57.584820] PGD 0
[57.586834] Oops: 0002 [#1] SMP
[57.685506] CPU: 14 PID: 10077 Comm: kworker/u40:1 Tainted: P OENX 3.12.48-52.27.1.     9688.1.PTF-default #1
[57.703637] Workqueue: tipc_rcv tipc_recv_work [tipc]
[57.708697] task: ffff88064c7f00c0 ti: ffff880629ef4000 task.ti: ffff880629ef4000
[57.716181] RIP: 0010:[<ffffffffa02135aa>]  [<ffffffffa02135aa>] tipc_nametbl_unsubscribe+0x8a/   0x120 [tipc]
[...]
[57.812327] Call Trace:
[57.814806]  [<ffffffffa0211c77>] tipc_subscrp_delete+0x37/0x90 [tipc]
[57.821357]  [<ffffffffa0211e2f>] tipc_subscrp_timeout+0x3f/0x70 [tipc]
[57.827982]  [<ffffffff810618c1>] call_timer_fn+0x31/0x100
[57.833490]  [<ffffffff81062709>] run_timer_softirq+0x1f9/0x2b0
[57.839414]  [<ffffffff8105a795>] __do_softirq+0xe5/0x230
[57.844827]  [<ffffffff81520d1c>] call_softirq+0x1c/0x30
[57.850150]  [<ffffffff81004665>] do_softirq+0x55/0x90
[57.855285]  [<ffffffff8105aa35>] irq_exit+0x95/0xa0
[57.860290]  [<ffffffff815215b5>] smp_apic_timer_interrupt+0x45/0x60
[57.866644]  [<ffffffff8152005d>] apic_timer_interrupt+0x6d/0x80
[57.872686]  [<ffffffffa02121c5>] tipc_subscrb_rcv_cb+0x2a5/0x3f0 [tipc]
[57.879425]  [<ffffffffa021c65f>] tipc_receive_from_sock+0x9f/0x100 [tipc]
[57.886324]  [<ffffffffa021c826>] tipc_recv_work+0x26/0x60 [tipc]
[57.892463]  [<ffffffff8106fb22>] process_one_work+0x172/0x420
[57.898309]  [<ffffffff8107079a>] worker_thread+0x11a/0x3c0
[57.903871]  [<ffffffff81077114>] kthread+0xb4/0xc0
[57.908751]  [<ffffffff8151f318>] ret_from_fork+0x58/0x90

In this commit, we do the following at subscription creation:
1. set the subscription's subscriber pointer before performing
   tipc_nametbl_subscribe(), as this value is required further in
   the call chain ex: by tipc_subscrp_send_event().
2. move tipc_nametbl_subscribe() under the scope of subscriber lock
Acked-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d4091899

tipc: fix connection abort when receiving invalid cancel request · cb01c7c8

Parthasarathy Bhuvaragan authored Feb 02, 2016

Until now, the subscribers endianness for a subscription
create/cancel request is determined as:
    swap = !(s->filter & (TIPC_SUB_PORTS | TIPC_SUB_SERVICE))
The checks are performed only for port/service subscriptions.

The swap calculation is incorrect if the filter in the subscription
cancellation request is set to TIPC_SUB_CANCEL (it's a malformed
cancel request, as the corresponding subscription create filter
is missing).
Thus, the check if the request is for cancellation fails and the
request is treated as a subscription create request. The
subscription creation fails as the request is illegal, which
terminates this connection.

In this commit we determine the endianness by including
TIPC_SUB_CANCEL, which will set swap correctly and the
request is processed as a cancellation request.
Acked-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cb01c7c8

tipc: fix connection abort during subscription cancellation · c8beccc6

Parthasarathy Bhuvaragan authored Feb 02, 2016

In 'commit 7fe8097c ("tipc: fix nullpointer bug when subscribing
to events")', we terminate the connection if the subscription
creation fails.
In the same commit, the subscription creation result was based on
the value of subscription pointer (set in the function) instead of
the return code.

Unfortunately, the same function also handles subscription
cancellation request. For a subscription cancellation request,
the subscription pointer cannot be set. Thus the connection is
terminated during cancellation request.

In this commit, we move the subcription cancel check outside
of tipc_subscrp_create(). Hence,
- tipc_subscrp_create() will create a subscripton
- tipc_subscrb_rcv_cb() will subscribe or cancel a subscription.

Fixes: 'commit 7fe8097c ("tipc: fix nullpointer bug when subscribing to events")'
Acked-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c8beccc6

tipc: introduce tipc_subscrb_subscribe() routine · 7c13c622

Parthasarathy Bhuvaragan authored Feb 02, 2016

In this commit, we split tipc_subscrp_create() into two:
1. tipc_subscrp_create() creates a subscription
2. A new function tipc_subscrp_subscribe() adds the
   subscription to the subscriber subscription list,
   activates the subscription timer and subscribes to
   the nametable updates.

In future commits, the purpose of tipc_subscrb_rcv_cb() will
be to either subscribe or cancel a subscription.

There is no functional change in this commit.
Acked-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7c13c622

tipc: remove struct tipc_name_seq from struct tipc_subscription · a4273c73

Parthasarathy Bhuvaragan authored Feb 02, 2016

Until now, struct tipc_subscriber has duplicate fields for
type, upper and lower (as member of struct tipc_name_seq) at:
1. as member seq in struct tipc_subscription
2. as member seq in struct tipc_subscr, which is contained
   in struct tipc_event
The former structure contains the type, upper and lower
values in network byte order and the later contains the
intact copy of the request.
The struct tipc_subscription contains a field swap to
determine if request needs network byte order conversion.
Thus by using swap, we can convert the request when
required instead of duplicating it.

In this commit,
1. we remove the references to these elements as members of
   struct tipc_subscription and replace them with elements
   from struct tipc_subscr.
2. provide new functions to convert the user request into
   network byte order.
Acked-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a4273c73

tipc: remove filter and timeout elements from struct tipc_subscription · 30865231

Parthasarathy Bhuvaragan authored Feb 02, 2016

Until now, struct tipc_subscription has duplicate timeout and filter
attributes present:
1. directly as members of struct tipc_subscription
2. in struct tipc_subscr, which is contained in struct tipc_event

In this commit, we remove the references to these elements as
members of struct tipc_subscription and replace them with elements
from struct tipc_subscr.
Acked-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

30865231

tipc: remove incorrect check for subscription timeout value · 4f61d4ef

Parthasarathy Bhuvaragan authored Feb 02, 2016

Until now, during subscription creation we set sub->timeout by
converting the timeout request value in milliseconds to jiffies.
This is followed by setting the timeout value in the timer if
sub->timeout != TIPC_WAIT_FOREVER.

For a subscription create request with a timeout value of
TIPC_WAIT_FOREVER, msecs_to_jiffies(TIPC_WAIT_FOREVER)
returns MAX_JIFFY_OFFSET (0xfffffffe). This is not equal to
TIPC_WAIT_FOREVER (0xffffffff).

In this commit, we remove this check.
Acked-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4f61d4ef

bonding: add slave device name for debug · c6140a29

Zhang Shengju authored Feb 02, 2016

netdev_dbg() will add bond device name, it will be helpful if we print
slave device name.
Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c6140a29

bgmac: add helper checking for BCM4707 / BCM53018 chip id · 387b75f8

Rafał Miłecki authored Feb 02, 2016

Chipsets with BCM4707 / BCM53018 ID require special handling at a few
places in the code. It's likely there will be more IDs to check in the
future. To simplify it add this trivial helper.
Signed-off-by: Rafał Miłecki <zajec5@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

387b75f8

Merge branch 'bpf-per-cpu-maps' · 8ac2c867

David S. Miller authored Feb 06, 2016

Alexei Starovoitov says:

====================
bpf: introduce per-cpu maps

We've started to use bpf to trace every packet and atomic add
instruction (event JITed) started to show up in perf profile.
The solution is to do per-cpu counters.
For PERCPU_(HASH|ARRAY) map the existing bpf_map_lookup() helper
returns per-cpu area which bpf programs can use to store and
increment the counters. The BPF_MAP_LOOKUP_ELEM syscall command
returns areas from all cpus and user process aggregates the counters.
The usage example is in patch 6. The api turned out to be very
easy to use from bpf program and from user space.
Long term we were discussing to add 'bounded loop' instruction,
so bpf programs can do aggregation within the program which may
help some use cases. Right now user space aggregation of
per-cpu counters fits the best.

This patch set is new approach for per-cpu hash and array maps.
I've reused the map tests written by Martin and Ming, but
implementation and api is new. Old discussion here:
http://thread.gmane.org/gmane.linux.kernel/2123800/focus=2126435
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

8ac2c867

samples/bpf: update tracex[23] examples to use per-cpu maps · 3059303f
Alexei Starovoitov authored Feb 01, 2016
```
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
```
3059303f

samples/bpf: unit test for BPF_MAP_TYPE_PERCPU_ARRAY · df570f57

tom.leiming@gmail.com authored Feb 01, 2016

A sanity test for BPF_MAP_TYPE_PERCPU_ARRAY
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

df570f57

samples/bpf: unit test for BPF_MAP_TYPE_PERCPU_HASH · e1559671

Martin KaFai Lau authored Feb 01, 2016

A sanity test for BPF_MAP_TYPE_PERCPU_HASH.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

e1559671

bpf: add lookup/update support for per-cpu hash and array maps · 15a07b33

Alexei Starovoitov authored Feb 01, 2016

The functions bpf_map_lookup_elem(map, key, value) and
bpf_map_update_elem(map, key, value, flags) need to get/set
values from all-cpus for per-cpu hash and array maps,
so that user space can aggregate/update them as necessary.

Example of single counter aggregation in user space:
  unsigned int nr_cpus = sysconf(_SC_NPROCESSORS_CONF);
  long values[nr_cpus];
  long value = 0;

  bpf_lookup_elem(fd, key, values);
  for (i = 0; i < nr_cpus; i++)
    value += values[i];

The user space must provide round_up(value_size, 8) * nr_cpus
array to get/set values, since kernel will use 'long' copy
of per-cpu values to try to copy good counters atomically.
It's a best-effort, since bpf programs and user space are racing
to access the same memory.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

15a07b33

bpf: introduce BPF_MAP_TYPE_PERCPU_ARRAY map · a10423b8

Alexei Starovoitov authored Feb 01, 2016

Primary use case is a histogram array of latency
where bpf program computes the latency of block requests or other
events and stores histogram of latency into array of 64 elements.
All cpus are constantly running, so normal increment is not accurate,
bpf_xadd causes cache ping-pong and this per-cpu approach allows
fastest collision-free counters.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

a10423b8

bpf: introduce BPF_MAP_TYPE_PERCPU_HASH map · 824bd0ce

Alexei Starovoitov authored Feb 01, 2016

Introduce BPF_MAP_TYPE_PERCPU_HASH map type which is used to do
accurate counters without need to use BPF_XADD instruction which turned
out to be too costly for high-performance network monitoring.
In the typical use case the 'key' is the flow tuple or other long
living object that sees a lot of events per second.

bpf_map_lookup_elem() returns per-cpu area.
Example:
struct {
  u32 packets;
  u32 bytes;
} * ptr = bpf_map_lookup_elem(&map, &key);
/* ptr points to this_cpu area of the value, so the following
 * increments will not collide with other cpus
 */
ptr->packets ++;
ptr->bytes += skb->len;

bpf_update_elem() atomically creates a new element where all per-cpu
values are zero initialized and this_cpu value is populated with
given 'value'.
Note that non-per-cpu hash map always allocates new element
and then deletes old after rcu grace period to maintain atomicity
of update. Per-cpu hash map updates element values in-place.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

824bd0ce

ethtool: Declare netdev_rss_key as __read_mostly. · ba905f5e

Kim Jones authored Feb 02, 2016

netdev_rss_key is written to once and thereafter is read by
drivers when they are initialising. The fact that it is mostly
read and not written to makes it a candidate for a __read_mostly
declaration.
Signed-off-by: Kim Jones <kim-marie.jones@intel.com>
Signed-off-by: Alan Carey <alan.carey@intel.com>
Acked-by: Rami Rosen <rami.rosen@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ba905f5e

Merge branch 'tcp_fast_open_synack_fin' · ef449678

David S. Miller authored Feb 06, 2016

Eric Dumazet says:

====================
tcp: fastopen: accept data/FIN present in SYNACK

Implements RFC 7413 (TCP Fast Open) 4.2.2, accepting payload and/or FIN
in SYNACK messages, and prepare removal of SYN flag in tcp_recvmsg()
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

ef449678

tcp: do not enqueue skb with SYN flag · 9d691539

Eric Dumazet authored Feb 01, 2016

If we remove the SYN flag from the skbs that tcp_fastopen_add_skb()
places in socket receive queue, then we can remove the test that
tcp_recvmsg() has to perform in fast path.

All we have to do is to adjust SEQ in the slow path.

For the moment, we place an unlikely() and output a message
if we find an skb having SYN flag set.
Goal would be to get rid of the test completely.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9d691539

tcp: fastopen: accept data/FIN present in SYNACK message · 61d2bcae

Eric Dumazet authored Feb 01, 2016

RFC 7413 (TCP Fast Open) 4.2.2 states that the SYNACK message
MAY include data and/or FIN

This patch adds support for the client side :

If we receive a SYNACK with payload or FIN, queue the skb instead
of ignoring it.

Since we already support the same for SYN, we refactor the existing
code and reuse it. Note we need to clone the skb, so this operation
might fail under memory pressure.

Sara Dickinson pointed out FreeBSD server Fast Open implementation
was planned to generate such SYNACK in the future.

The server side might be implemented on linux later.
Reported-by: Sara Dickinson <sara@sinodun.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

61d2bcae

Merge branch 'rx_nohandler' · df03288b

David S. Miller authored Feb 06, 2016

Jarod Wilson says:

====================
net: add and use rx_nohandler stat counter

The network core tries to keep track of dropped packets, but some packets
you wouldn't really call dropped, so much as intentionally ignored, under
certain circumstances. One such case is that of bonding and team device
slaves that are currently inactive. Their respective rx_handler functions
return RX_HANDLER_EXACT (the only places in the kernel that return that),
which ends up tracking into the network core's __netif_receive_skb_core()
function's drop path, with no pt_prev set. On a noisy network, this can
result in a very rapidly incrementing rx_dropped counter, not only on the
inactive slave(s), but also on the master device, such as the following:

$ cat /proc/net/dev
Inter-|   Receive                                                |  Transmit
 face |bytes    packets errs drop fifo frame compressed multicast|bytes    packets errs drop fifo colls carrier compressed
  p7p1: 14783346  140430    0 140428    0     0          0      2040      680       8    0    0    0     0       0          0
  p7p2: 14805198  140648    0    0    0     0          0      2034        0       0    0    0    0     0       0          0
 bond0: 53365248  532798    0 421160    0     0          0    115151     2040      24    0    0    0     0       0          0
    lo:    5420      54    0    0    0     0          0         0     5420      54    0    0    0     0       0          0
  p5p1: 19292195  196197    0 140368    0     0          0     56564      680       8    0    0    0     0       0          0
  p5p2: 19289707  196171    0 140364    0     0          0     56547      680       8    0    0    0     0       0          0
   em3: 20996626  158214    0    0    0     0          0       383        0       0    0    0    0     0       0          0
   em2: 14065122  138462    0    0    0     0          0       310        0       0    0    0    0     0       0          0
   em1: 14063162  138440    0    0    0     0          0       308        0       0    0    0    0     0       0          0
   em4: 21050830  158729    0    0    0     0          0       385    71662     469    0    0    0     0       0          0
   ib0:       0       0    0    0    0     0          0         0        0       0    0    0    0     0       0          0

In this scenario, p5p1, p5p2 and p7p1 are all inactive slaves in an
active-backup bond0, and you can see that all three have high drop counts,
with the master bond0 showing a tally of all three.

I know that this was previously discussed some here:

    http://www.spinics.net/lists/netdev/msg226341.html

It seems additional counters never came to fruition, so this is a first
attempt at creating one of them, so that we stop calling these drops,
which for users monitoring rx_dropped, causes great alarm, and renders the
counter much less useful for them.

This adds a sysfs statistics node and makes the counter available via
netlink.

Additionally, I'm not certain if this set qualifies for net, or if it
should be put aside and resubmitted for net-next after 4.5 is put to
bed, but I do have users who consider this an important bugfix.

This has been tested quite a bit on x86_64, and now lightly on i686 as
well, to verify functionality of updates to netdev_stats_to_stats64()
on 32-bit arches.
====================
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

df03288b

bond: track sum of rx_nohandler for all slaves · f344b0d9

Jarod Wilson authored Feb 01, 2016

Sample output with this set applied for an active-backup bond:

$ cat /sys/devices/virtual/net/bond0/lower_p7p1/statistics/rx_nohandler
16568
$ cat /sys/devices/virtual/net/bond0/lower_p5p2/statistics/rx_nohandler
16583
$ cat /sys/devices/virtual/net/bond0/statistics/rx_nohandler
33151

CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <gospo@cumulusnetworks.com>
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f344b0d9

team: track sum of rx_nohandler for all slaves · bb63daf9

Jarod Wilson authored Feb 01, 2016

CC: Jiri Pirko <jiri@resnulli.us>
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bb63daf9

net: add rx_nohandler stat counter · 6e7333d3

Jarod Wilson authored Feb 01, 2016

This adds an rx_nohandler stat counter, along with a sysfs statistics
node, and copies the counter out via netlink as well.

CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@mellanox.com>
CC: Daniel Borkmann <daniel@iogearbox.net>
CC: Tom Herbert <tom@herbertland.com>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <gospo@cumulusnetworks.com>
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6e7333d3

net/core: relax BUILD_BUG_ON in netdev_stats_to_stats64 · 9256645a

Jarod Wilson authored Feb 01, 2016

The netdev_stats_to_stats64 function copies the deprecated
net_device_stats format stats into rtnl_link_stats64 for legacy support
purposes, but with the BUILD_BUG_ON as it was, it wasn't possible to
extend rtnl_link_stats64 without also extending net_device_stats. Relax
the BUILD_BUG_ON to only require that rtnl_link_stats64 is larger, and
zero out all the stat counters that aren't present in net_device_stats.

CC: Eric Dumazet <edumazet@google.com>
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9256645a

tipc: fix link priority propagation · 81729810

Richard Alpe authored Feb 01, 2016

Currently link priority changes isn't handled for active links. In
this patch we resolve this by changing our priority if the peer passes
a valid priority in a state message.
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

81729810

tipc: fix link attribute propagation bug · d01332f1

Richard Alpe authored Feb 01, 2016

Changing certain link attributes (link tolerance and link priority)
from the TIPC management tool is supposed to automatically take
effect at both endpoints of the affected link.

Currently the media address is not instantiated for the link and is
used uninstantiated when crafting protocol messages designated for the
peer endpoint. This means that changing a link property currently
results in the property being changed on the local machine but the
protocol message designated for the peer gets lost. Resulting in
property discrepancy between the endpoints.

In this patch we resolve this by using the media address from the
link entry and using the bearer transmit function to send it. Hence,
we can now eliminate the redundant function tipc_link_prot_xmit() and
the redundant field tipc_link::media_addr.

Fixes: 2af5ae37 (tipc: clean up unused code and structures)
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Reported-by: Jason Hu <huzhijiang@gmail.com>
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d01332f1

Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue · 6247fd9f

David S. Miller authored Feb 06, 2016

Jeff Kirsher says:

====================
40GbE Intel Wired LAN Driver Updates 2016-02-03

This series contains updates to i40e and i40evf only.

Kiran adds the MAC filter element to the end of the list instead of HEAD
just in case there are ever any ordering issues in the future.

Anjali fixes several RSS issues, first fixes the hash PCTYPE enable for
X722 since it supports a broader selection of PCTYPES for TCP and UDP.
Then fixes a bug in XL710, X710, and X722 support for RSS since we cannot
reduce the 4-tuple for RSS for TCP/IPv4/IPv6 or UDP/IPv4/IPv6 packets
since this requires a product feature change coming in a later release.
Cleans up the reset code where the restart-autoneg workaround is
applied, since X722 does not need the workaround, add a flag to indicate
which MAC and firmware version require the workaround to be applied.
Adds new device id's for X722 and code to add their support.  Also
adds another way to access the RSS keys and lookup table using the admin
queue for X722 devices.

Catherine updates the driver to replace the MAC check with a feature
flag check for 100M SGMII, since it is only support on X722 devices
currently.

Mitch reworks the VF driver to allow channel bonding, which was not
possible before this patch due to the asynchronous nature of the admin
queue mechanism.  Also fixes a rare case which causes a panic if the
VF driver is removed during reset recovery, resolve this by setting the
ring pointers to NULL after freeing them.

Shannon cleans up the driver where device capabilities were defined in
two different places, and neither had all the definitions, so he
consolidates the definitions in the admin queue API.  Also adds the new
proxy-wake-on-lan capability bit available with the new X722 device.
Lastly, added the new External Device Power Ability field to the
get_link_status data structure by using a reserved field at the end
of the structure.

Jesse mimics the ixgbe driver's use of a private work queue in the i40e
and i40evf drivers to avoid blocking the system work queue.

Greg cleans up the driver to limit the firmware revision checks to
properly handle DCB configurations from the firmware to the older
devices which need these checks (specifically X710 and XL710 devices
only).
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

6247fd9f

05 Feb, 2016 1 commit

ipvlan: inherit MTU from master device · 296d4856

Mahesh Bandewar authored Jan 27, 2016

When we create IPvlan slave; we use ether_setup() and that
sets up default MTU to 1500 while the master device may have
lower / different MTU. Any subsequent changes to the masters'
MTU are reflected into the slaves' MTU setting. However if those
don't happen (most likely scenario), the slaves' MTU stays at
1500 which could be bad.

This change adds code to inherit MTU from the master device
instead of using the default value during the link initialization
phase.
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Tim Hockins <thockins@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

296d4856

04 Feb, 2016 6 commits

i40e: add 100Mb ethtool reporting · f8db54cc

Catherine Sullivan authored Dec 22, 2015

Add some missing reporting/advertisement of 100Mb capability
for adapters that support it.

Change-ID: I8b8523fbdc99517bec29d90c71b3744db11542ac
Signed-off-by: Catherine Sullivan <catherine.sullivan@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

f8db54cc

i40e: AQ Add external power class to get link status · 5eb772f7

Shannon Nelson authored Dec 22, 2015

Add the new External Device Power Ability field to the get_link_status data
structure, using space from the reserved field at the end of the struct.
Signed-off-by: Shannon Nelson <shannon.nelson@intel.com>
Acked-by: Kevin Scott <kevin.c.scott@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

5eb772f7

i40e: AQ Geneve cloud tunnel type · 59264253

Shannon Nelson authored Dec 22, 2015

Fix the name of the new cloud tunnel type from the place-holder NGE
name to the official Geneve.  Also fix the spelling of the VXLAN type.
Signed-off-by: Shannon Nelson <shannon.nelson@intel.com>
Acked-by: Kevin Scott <kevin.c.scott@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>

59264253

i40e: AQ Add Run PHY Activity struct · 5394f02f

Shannon Nelson authored Dec 22, 2015

Add the AQ opcode and struct definitions for the Run PHY Activity command
Signed-off-by: Shannon Nelson <shannon.nelson@intel.com>
Acked-by: Kevin Scott <kevin.c.scott@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

5394f02f

i40e: Limit DCB FW version checks to X710/XL710 devices · 6dfae389

Greg Bowers authored Dec 22, 2015

X710/XL710 devices require FW version checks to properly handle DCB
configurations from the FW.  Newer devices do not, so limit these checks
to X710/XL710.
Signed-off-by: Greg Bowers <gregory.j.bowers@intel.com>
Acked-by: Shannon Nelson <shannon.nelson@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

6dfae389

i40e: add new proxy-wol bit for X722 · 4ba40bce

Shannon Nelson authored Dec 22, 2015

Add the new proxy-wake-on-lan capability bit available with the
new X722 device.
Signed-off-by: Shannon Nelson <shannon.nelson@intel.com>
Acked-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

4ba40bce