Commits · 1aacde3d22c42281236155c1ef6d7a5aa32a826b · nexedi / linux

01 Jul, 2016 22 commits

bpf: generally move prog destruction to RCU deferral · 1aacde3d

Daniel Borkmann authored Jun 30, 2016

Jann Horn reported following analysis that could potentially result
in a very hard to trigger (if not impossible) UAF race, to quote his
event timeline:

 - Set up a process with threads T1, T2 and T3
 - Let T1 set up a socket filter F1 that invokes another filter F2
   through a BPF map [tail call]
 - Let T1 trigger the socket filter via a unix domain socket write,
   don't wait for completion
 - Let T2 call PERF_EVENT_IOC_SET_BPF with F2, don't wait for completion
 - Now T2 should be behind bpf_prog_get(), but before bpf_prog_put()
 - Let T3 close the file descriptor for F2, dropping the reference
   count of F2 to 2
 - At this point, T1 should have looked up F2 from the map, but not
   finished executing it
 - Let T3 remove F2 from the BPF map, dropping the reference count of
   F2 to 1
 - Now T2 should call bpf_prog_put() (wrong BPF program type), dropping
   the reference count of F2 to 0 and scheduling bpf_prog_free_deferred()
   via schedule_work()
 - At this point, the BPF program could be freed
 - BPF execution is still running in a freed BPF program

While at PERF_EVENT_IOC_SET_BPF time it's only guaranteed that the perf
event fd we're doing the syscall on doesn't disappear from underneath us
for whole syscall time, it may not be the case for the bpf fd used as
an argument only after we did the put. It needs to be a valid fd pointing
to a BPF program at the time of the call to make the bpf_prog_get() and
while T2 gets preempted, F2 must have dropped reference to 1 on the other
CPU. The fput() from the close() in T3 should also add additionally delay
to the reference drop via exit_task_work() when bpf_prog_release() gets
called as well as scheduling bpf_prog_free_deferred().

That said, it makes nevertheless sense to move the BPF prog destruction
generally after RCU grace period to guarantee that such scenario above,
but also others as recently fixed in ceb56070 ("bpf, perf: delay release
of BPF prog after grace period") with regards to tail calls won't happen.
Integrating bpf_prog_free_deferred() directly into the RCU callback is
not allowed since the invocation might happen from either softirq or
process context, so we're not permitted to block. Reviewing all bpf_prog_put()
invocations from eBPF side (note, cBPF -> eBPF progs don't use this for
their destruction) with call_rcu() look good to me.

Since we don't know whether at the time of attaching the program, we're
already part of a tail call map, we need to use RCU variant. However, due
to this, there won't be severely more stress on the RCU callback queue:
situations with above bpf_prog_get() and bpf_prog_put() combo in practice
normally won't lead to releases, but even if they would, enough effort/
cycles have to be put into loading a BPF program into the kernel already.
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

1aacde3d

atm: horizon: Use setup_timer · 466fc793

Amitoj Kaur Chawla authored Jun 30, 2016

Convert a call to init_timer and accompanying intializations of
the timer's data and function fields to a call to setup_timer.

The Coccinelle semantic patch that fixes this problem is
as follows:
@@
expression t,d,f,e1;
identifier x1;
statement S1;
@@

(
-t.data = d;
|
-t.function = f;
|
-init_timer(&t);
+setup_timer(&t,f,d);
|
-init_timer_on_stack(&t);
+setup_timer_on_stack(&t,f,d);
)
<... when != S1
t.x1 = e1;
...>
Signed-off-by: Amitoj Kaur Chawla <amitoj1606@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

466fc793

Merge branch 'qed-next' · e3cc6e37

David S. Miller authored Jul 01, 2016

Manish Chopra says:

====================
qede: Enhancements

This patch series have few small fastpath features
support and code refactoring.

Note - regarding get/set tunable configuration via ethtool
Surprisingly, there is NO ethtool application support for
such configuration given that we have kernel support.
Do let us know if we need to add support for that in user ethtool.

Please consider applying this series to "net-next".
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

e3cc6e37

qede: Bump up driver version to 8.10.1.20 · 831a8e6c

Manish Chopra authored Jun 30, 2016

Signed-off-by: Manish Chopra <manish.chopra@qlogic.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

831a8e6c

qede: Add get/set rx copy break tunable support · 3d789994

Manish Chopra authored Jun 30, 2016

Signed-off-by: Manish <manish.chopra@qlogic.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3d789994

qede: Utilize xmit_more · 312e0676

Manish Chopra authored Jun 30, 2016

This patch uses xmit_more optimization to reduce
number of TX doorbells write per packet.
Signed-off-by: Manish <manish.chopra@qlogic.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

312e0676

qede: qede_poll refactoring · c774169d

Manish Chopra authored Jun 30, 2016

This patch cleanups qede_poll() routine a bit
and allows qede_poll() to do single iteration to handle
TX completion [As under heavy TX load qede_poll() might
run for indefinite time in the while(1) loop for TX
completion processing and cause CPU stuck].
Signed-off-by: Manish <manish.chopra@qlogic.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c774169d

qede: Add support for handling IP fragmented packets. · c72a6125

Manish Chopra authored Jun 30, 2016

When handling IP fragmented packets with csum in their
transport header, the csum isn't changed as part of the
fragmentation. As a result, the packet containing the
transport headers would have the correct csum of the original
packet, but one that mismatches the actual packet that
passes on the wire. As a result, on receive path HW would
give an indication that the packet has incorrect csum,
which would cause qede to discard the incoming packet.

Since HW also delivers a notification of IP fragments,
change driver behavior to pass such incoming packets
to stack and let it make the decision whether it needs
to be dropped.
Signed-off-by: Manish <manish.chopra@qlogic.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c72a6125

Merge branch 'tun-skb_array' · beb528d0

David S. Miller authored Jul 01, 2016

Jason Wang says:

====================
switch to use tx skb array in tun

This series tries to switch to use skb array in tun. This is used to
eliminate the spinlock contention between producer and consumer. The
conversion was straightforward: just introdce a tx skb array and use
it instead of sk_receive_queue.

A minor issue is to keep the tx_queue_len behaviour, since tun used to
use it for the length of sk_receive_queue. This is done through:

- add the ability to resize multiple rings at once to avoid handling
  partial resize failure for mutiple rings.
- add the support for zero length ring.
- introduce a notifier which was triggered when tx_queue_len was
  changed for a netdev.
- resize all queues during the tx_queue_len changing.

Tests shows about 15% improvement on guest rx pps:

Before: ~1300000pps
After : ~1500000pps

Changes from V3:
- fix kbuild warnings
- call NETDEV_CHANGE_TX_QUEUE_LEN on IFLA_TXQLEN

Changes from V2:
- add multiple rings resizing support for ptr_ring/skb_array
- add zero length ring support
- introdce a NETDEV_CHANGE_TX_QUEUE_LEN
- drop new flags

Changes from V1:
- switch to use skb array instead of a customized circular buffer
- add non-blocking support
- rename .peek to .peek_len
- drop lockless peeking since test show very minor improvement
====================
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-from-altitude: 34697 feet.
Signed-off-by: David S. Miller <davem@davemloft.net>

beb528d0

tun: switch to use skb array for tx · 1576d986

Jason Wang authored Jun 30, 2016

We used to queue tx packets in sk_receive_queue, this is less
efficient since it requires spinlocks to synchronize between producer
and consumer.

This patch tries to address this by:

- switch from sk_receive_queue to a skb_array, and resize it when
  tx_queue_len was changed.
- introduce a new proto_ops peek_len which was used for peeking the
  skb length.
- implement a tun version of peek_len for vhost_net to use and convert
  vhost_net to use peek_len if possible.

Pktgen test shows about 15.3% improvement on guest receiving pps for small
buffers:

Before: ~1300000pps
After : ~1500000pps
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1576d986

net: introduce NETDEV_CHANGE_TX_QUEUE_LEN · 08294a26

Jason Wang authored Jun 30, 2016

This patch introduces a new event - NETDEV_CHANGE_TX_QUEUE_LEN, this
will be triggered when tx_queue_len. It could be used by net device
who want to do some processing at that time. An example is tun who may
want to resize tx array when tx_queue_len is changed.

Cc: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

08294a26

skb_array: add wrappers for resizing · bf900b3d

Jason Wang authored Jun 30, 2016

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bf900b3d

ptr_ring: support resizing multiple queues · 59e6ae53

Michael S. Tsirkin authored Jun 30, 2016

Sometimes, we need support resizing multiple queues at once. This is
because it was not easy to recover to recover from a partial failure
of multiple queues resizing.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

59e6ae53

skb_array: minor tweak · fd68adec

Jason Wang authored Jun 30, 2016

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fd68adec

ptr_ring: support zero length ring · 982fb490

Jason Wang authored Jun 30, 2016

Sometimes, we need zero length ring. But current code will crash since
we don't do any check before accessing the ring. This patch fixes this.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

982fb490

Merge branch 'sch_hfsc-fixes-cleanups' · 8dc7243a

David S. Miller authored Jul 01, 2016

Michal Soltys says:

====================
HFSC patches, part 1

It's revised version of part of the patches I submitted really, really long
time ago (back then I asked Patrick to ignore them as I found some issues
shortly after submitting).

Anyway this is the first set with very simple fixes/changes though some of them
relatively subtle (I tried to do very exhaustive commit messages explaining what
and why with those).

The patches are against net-next tree.

The second set will be heavier - or rather with more complex explanations, among those I have:

- a fix to subtle issue introduced in
  http://permalink.gmane.org/gmane.linux.kernel.commits.2-4/8281
  along with simplifying related stuff
- update times to 96 bits (which allows to "just" use 32 bit shifts and
  improves curve definition accuracy at more extreme low/high speeds)
- add curve "merging" instead of just selecting in convex case (computations
  mirror those from concave intersection)

But these are eventually for later.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

8dc7243a

net/sched/sch_hfsc.c: anchor virtual curve at proper vt in hfsc_change_fsc() · 33ef84a7

Michal Soltys authored Jun 30, 2016

cl->cl_vt alone is relative only to the current backlog period, while
the curve operates on cumulative virtual time. This patch adds missing
cl->cl_vtoff.
Signed-off-by: Michal Soltys <soltys@ziu.info>
Signed-off-by: David S. Miller <davem@davemloft.net>

33ef84a7

net/sched/sch_hfsc.c: go passive after vt update · ab12cb47

Michal Soltys authored Jun 30, 2016

When a class is going passive, it should update its cl_vt first
to be consistent with the last dequeue operation.

Otherwise its cl_vt will be one packet behind and parent's cvtmax might
not be updated as well.

One possible side effect is if some class goes passive and subsequently
goes active /without/ its parent going passive - with cl_vt lagging one
packet behind - comparison made in init_vf() will be affected (same
period).
Signed-off-by: Michal Soltys <soltys@ziu.info>
Signed-off-by: David S. Miller <davem@davemloft.net>

ab12cb47

net/sched/sch_hfsc.c: remove leftover dlist and droplist · 2354f056

Michal Soltys authored Jun 30, 2016

This is update to:
commit a09ceb0e ("sched: remove qdisc->drop")

That commit removed qdisc->drop, but left alone dlist and droplist
that no longer serve any meaningful purpose.
Signed-off-by: Michal Soltys <soltys@ziu.info>
Signed-off-by: David S. Miller <davem@davemloft.net>

2354f056

net/sched/sch_hfsc.c: add unlikely() in qdisc_peek_len() · d1d0fc5e

Michal Soltys authored Jun 30, 2016

The condition can only succeed on wrong configurations.
Signed-off-by: Michal Soltys <soltys@ziu.info>
Signed-off-by: David S. Miller <davem@davemloft.net>

d1d0fc5e

net/sched/sch_hfsc.c: handle corner cases where head may change invalidating calculated deadline · 12d0ad3b

Michal Soltys authored Jun 30, 2016

Realtime scheduling implemented in HFSC uses head of the queue to make
the decision about which packet to schedule next. But in case of any
head drop, the deadline calculated for the previous head is not
necessarily correct for the next head (unless both packets have the same
length).

Thanks to peek() function used during dequeue - which internally is a
dequeue operation - hfsc is almost safe from this issue, as peek()
dequeues and isolates the head storing it temporarily until the real
dequeue happens.

But there is one exception: if after the class activation a drop happens
before the first dequeue operation, there's never a chance to do the
peek().

Adding peek() call in enqueue - if this is the first packet in a new
backlog period AND the scheduler has realtime curve defined - fixes that
one corner case. The 1st hfsc_dequeue() will use that peeked packet,
similarly as every subsequent hfsc_dequeue() call uses packet peeked by
the previous call.
Signed-off-by: Michal Soltys <soltys@ziu.info>
Signed-off-by: David S. Miller <davem@davemloft.net>

12d0ad3b

tcp: md5: use kmalloc() backed scratch areas · 19689e38

Eric Dumazet authored Jun 27, 2016

Some arches have virtually mapped kernel stacks, or will soon have.

tcp_md5_hash_header() uses an automatic variable to copy tcp header
before mangling th->check and calling crypto function, which might
be problematic on such arches.

David says that using percpu storage is also problematic on non SMP
builds.

Just use kmalloc() to allocate scratch areas.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

19689e38

30 Jun, 2016 18 commits

Merge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue · 435c556c

David S. Miller authored Jun 30, 2016

Jeff Kirsher says:

====================
Intel Wired LAN Driver Updates 2016-06-29

This series contains updates and fixes to e1000e, igb, ixgbe and fm10k.  A
true smorgasbord of changes.

Jake cleans up some obscurity by not using the BIT() macro on bitshift
operation and also fixed the calculated index when looping through the
indir array.  Fixes the issue with igb's workqueue item for overflow
check from causing a surprise remove event.  The ptp_flags variable is
added to simplify the work of writing several complex MAC type checks
in the PTP code while fixing the workqueue.

Alex Duyck fixes the receive buffers alignment which should not be L1
cache aligned, but to 512 bytes instead.

Denys Vlasenko prevents a division by zero which was reported under
VMWare for e1000e.

Amritha fixes an issue where filters in a child hash table must be
cleared from the hardware before delete the filter links in ixgbe.

Bhaktipriya Shridhar simply replaces the deprecated create_workqueue()
with alloc_workqueue() for fm10k.

Tony corrects ixgbe ethtool reporting to show x550 supports hardware
timestamping of all packets.

Emil fixes an issue where MAC-VLANs on the VF fail to pass traffic due
to spoofed packets.

Andrew Lunn increases performance on some systems where syncing a buffer
for DMA is expensive.  So rather than sync the whole 2K receive buffer,
only synchronize the length of the frame.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

435c556c

Merge branch 'nfp-next' · c435e6e0

David S. Miller authored Jun 30, 2016

Jakub Kicinski says:

====================
nfp: few code improvements

Three small patches for net-next.  First and second patches
improve the code quality by spelling things correctly and
removing unused parameters.  Third patch hooks-in standard
kernel implementation of .get_link() in ethtool ops.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

c435e6e0

nfp: implement ethtool .get_link() callback · 2370def2

Jakub Kicinski authored Jun 29, 2016

Point the ethtool .get_link() callback to the standard
ethtool_op_get_link() implementation.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2370def2

nfp: remove unused parameter from nfp_net_write_mac_addr() · f642963b

Jakub Kicinski authored Jun 29, 2016

nfp_net_write_mac_addr() always writes to the BAR the current
device address taken from netdev struct.  The address given
as parameter is actually ignored.  Since all callers pass
netdev->dev_addr simply remove the parameter.

While at it improve the function's kdoc a bit.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f642963b

nfp: correct name of control BAR define · 796312cd

Jakub Kicinski authored Jun 29, 2016

Spell abbreviation of control as ctrl not crtl.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

796312cd

be2net: signedness bug in be_msix_enable() · 6fde0e63

Dan Carpenter authored Jun 29, 2016

"num_vec" needs to be signed for the error handling to work.

Fixes: e261768e ('be2net: support asymmetric rx/tx queue counts')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Sathya Perla <sathya.perla@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6fde0e63

net: netcp: Fix a typo in keystone-netcp.txt · 9b9a553c

Masanari Iida authored Jun 29, 2016

This patch fix a spelling typo in keystone-netcp.txt
Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9b9a553c

Merge branch 'mediatek-next' · 833ba3d5

David S. Miller authored Jun 30, 2016

John Crispin says:

====================
net-next: mediatek: IRQ cleanups, fixes and grouping

This series contains 2 small code cleanups that are leftovers from the
MIPS support. There is also a small fix that adds proper locking to the
code accessing the IRQ registers. Without this fix we saw deadlocks caused
by the last patch of the series, which adds IRQ grouping. The grouping
feature allows us to use different IRQs for TX and RX. By doing so we can
use affinity to let the SoC handle the IRQs on different cores.

This series depends on a previous series currently sitting in net.git
starting with
	commit 562c5a70 ("net: mediatek: only wake the queue if it is stopped")
up to
	commit 82c6544d ("net: mediatek: remove superfluous queue wake up call")
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

833ba3d5

net-next: mediatek: add support for IRQ grouping · 80673029

John Crispin authored Jun 29, 2016

The ethernet core has 3 IRQs. Using the IRQ grouping registers we are able
to separate TX and RX IRQs, which allows us to service them on separate
cores. This patch splits the IRQ handler into 2 separate functions, one for
TX and another for RX. The TX housekeeping is split out into its own NAPI
handler.
Signed-off-by: John Crispin <john@phrozen.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

80673029

net-next: mediatek: add IRQ locking · 7bc9ccec

John Crispin authored Jun 29, 2016

The code that enables and disables IRQs is missing proper locking. After
adding the IRQ grouping patch and routing the RX and TX IRQs to different
cores we experienced IRQ stalls. Fix this by adding proper locking.
We use a dedicated lock to reduce the latency if the IRQ code.
Signed-off-by: John Crispin <john@phrozen.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

7bc9ccec

net-next: mediatek: don't use intermediate variables to store IRQ masks · eece71e8

John Crispin authored Jun 29, 2016

The code currently uses variables to store and never modify the bit masks
of interrupts. This is legacy code from an early version of the driver
that supported MIPS based SoCs where the IRQ bits depended on the actual
SoC. As the bits are the same for all ARM based SoCs using this driver we
can remove the intermediate variables.
Signed-off-by: John Crispin <john@phrozen.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

eece71e8

net-next: mediatek: remove superfluous register reads · 6e6edd8b

John Crispin authored Jun 29, 2016

The driver was originally written for MIPS based SoC. These required the
IRQ mask register to be read after writing it to ensure that the content
was actually applied. As this version only works on ARM based SoCs, we can
safely remove the 2 reads as they are no longer required.
Signed-off-by: John Crispin <john@phrozen.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

6e6edd8b

fib_rules: Added NLM_F_EXCL support to fib_nl_newrule · 153380ec

Mateusz Bajorski authored Jun 29, 2016

When adding rule with NLM_F_EXCL flag then check if the same rule exist.
If yes then exit with -EEXIST.

This is already implemented in iproute2:
        if (cmd == RTM_NEWRULE) {
                req.n.nlmsg_flags |= NLM_F_CREATE|NLM_F_EXCL;
                req.r.rtm_type = RTN_UNICAST;
        }

Tested ipv4 and ipv6 with net-next linux on qemu x86

expected behavior after patch:
localhost ~ # ip rule
0:    from all lookup local
32766:    from all lookup main
32767:    from all lookup default
localhost ~ # ip rule add from 10.46.177.97 lookup 104 pref 1005
localhost ~ # ip rule add from 10.46.177.97 lookup 104 pref 1005
RTNETLINK answers: File exists
localhost ~ # ip rule
0:    from all lookup local
1005:    from 10.46.177.97 lookup 104
32766:    from all lookup main
32767:    from all lookup default

There was already topic regarding this but I don't see any changes
merged and problem still occurs.
https://lkml.kernel.org/r/1135778809.5944.7.camel+%28%29+localhost+%21+localdomainSigned-off-by: Mateusz Bajorski <mateusz.bajorski@nokia.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

153380ec

tcp: increase size at which tcp_bound_to_half_wnd bounds to > TCP_MSS_DEFAULT · 2631b79f

Seymour, Shane M authored Jun 28, 2016

In previous commit 01f83d69
the following comments were added:

"When peer uses tiny windows, there is no use in packetizing to sub-MSS
pieces for the sake of SWS or making sure there are enough packets in
the pipe for fast recovery."

The test should be > TCP_MSS_DEFAULT not >= 512. This allows low end
devices that send an MSS of 536 (TCP_MSS_DEFAULT) to see better network
performance by sending it 536 bytes of data at a time instead of bounding
to half window size (268). Other network stacks work this way, e.g. HP-UX.
Signed-off-by: Shane Seymour <shane.seymour@hpe.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2631b79f

tcp: add an ability to dump and restore window parameters · b1ed4c4f

Andrey Vagin authored Jun 27, 2016

We found that sometimes a restored tcp socket doesn't work.

A reason of this bug is incorrect window parameters and in this case
tcp_acceptable_seq() returns tcp_wnd_end(tp) instead of tp->snd_nxt. The
other side drops packets with this seq, because seq is less than
tp->rcv_nxt ( tcp_sequence() ).

Data from a send queue is sent only if there is enough space in a
window, so when we restore unacked data, we need to expand a window to
fit this data.

This was in a first version of this patch:
"tcp: extend window to fit all restored unacked data in a send queue"

Then Alexey recommended me to restore window parameters instead of
adjusted them according with data in a sent queue. This sounds resonable.

rcv_wnd has to be restored, because it was reported to another side
and the offered window is never shrunk.
One of reasons why we need to restore snd_wnd was described above.

Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

b1ed4c4f

Merge branch 'bridge-igmp-stats' · 641f7e40

David S. Miller authored Jun 30, 2016

Nikolay Aleksandrov says:

====================
net: bridge: add support for IGMP/MLD stats

This patchset adds support for the new IFLA_STATS_LINK_XSTATS_SLAVE
attribute which can be used with RTM_GETSTATS in order to export per-slave
statistics. It works by passing the attribute to the linkxstats callback
and if the callback user supports it - it should dump that slave's stats.
This is much more scalable and permits us to request only a single port's
statistics instead of dumping everything every time.
The second patch adds support for per-port IGMP/MLD statistics and uses
the new API to export them for the bridge and its ports. The stats are
made in a very lightweight manner, the normal fast-path is not affected
at all and the flood paths (br_flood/br_multicast_flood) are only affected
if the packet is IGMP and the IGMP stats have been enabled using cache-hot
data for the check.

v2: Patch 01 is new, patch 02 has been reworked to use the new API, also
in addition counters for IGMP/MLD parse errors have been added and members
are added for per-port multicast traffic stats. The multicast counting has
been slightly optimized (moved the br_multicast_count inside the IPv4/6
IGMP functions after the checks for IGMP traffic) to avoid one conditional
that was on all of the multicast traffic path (both IGMP and other).
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

641f7e40

net: bridge: add support for IGMP/MLD stats and export them via netlink · 1080ab95

Nikolay Aleksandrov authored Jun 28, 2016

This patch adds stats support for the currently used IGMP/MLD types by the
bridge. The stats are per-port (plus one stat per-bridge) and per-direction
(RX/TX). The stats are exported via netlink via the new linkxstats API
(RTM_GETSTATS). In order to minimize the performance impact, a new option
is used to enable/disable the stats - multicast_stats_enabled, similar to
the recent vlan stats. Also in order to avoid multiple IGMP/MLD type
lookups and checks, we make use of the current "igmp" member of the bridge
private skb->cb region to record the type on Rx (both host-generated and
external packets pass by multicast_rcv()). We can do that since the igmp
member was used as a boolean and all the valid IGMP/MLD types are positive
values. The normal bridge fast-path is not affected at all, the only
affected paths are the flooding ones and since we make use of the IGMP/MLD
type, we can quickly determine if the packet should be counted using
cache-hot data (cb's igmp member). We add counters for:
* IGMP Queries
* IGMP Leaves
* IGMP v1/v2/v3 reports

* MLD Queries
* MLD Leaves
* MLD v1/v2 reports

These are invaluable when monitoring or debugging complex multicast setups
with bridges.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1080ab95

net: rtnetlink: add support for the IFLA_STATS_LINK_XSTATS_SLAVE attribute · 80e73cc5

Nikolay Aleksandrov authored Jun 28, 2016

This patch adds support for the IFLA_STATS_LINK_XSTATS_SLAVE attribute
which allows to export per-slave statistics if the master device supports
the linkxstats callback. The attribute is passed down to the linkxstats
callback and it is up to the callback user to use it (an example has been
added to the only current user - the bridge). This allows us to query only
specific slaves of master devices like bridge ports and export only what
we're interested in instead of having to dump all ports and searching only
for a single one. This will be used to export per-port IGMP/MLD stats and
also per-port vlan stats in the future, possibly other statistics as well.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

80e73cc5