Commits · 10724cc7bb7832b482df049c20fd824d928c5eaa · nexedi / linux

03 May, 2016 29 commits

tipc: redesign connection-level flow control · 10724cc7

Jon Paul Maloy authored May 02, 2016

There are two flow control mechanisms in TIPC; one at link level that
handles network congestion, burst control, and retransmission, and one
at connection level which' only remaining task is to prevent overflow
in the receiving socket buffer. In TIPC, the latter task has to be
solved end-to-end because messages can not be thrown away once they
have been accepted and delivered upwards from the link layer, i.e, we
can never permit the receive buffer to overflow.

Currently, this algorithm is message based. A counter in the receiving
socket keeps track of number of consumed messages, and sends a dedicated
acknowledge message back to the sender for each 256 consumed message.
A counter at the sending end keeps track of the sent, not yet
acknowledged messages, and blocks the sender if this number ever reaches
512 unacknowledged messages. When the missing acknowledge arrives, the
socket is then woken up for renewed transmission. This works well for
keeping the message flow running, as it almost never happens that a
sender socket is blocked this way.

A problem with the current mechanism is that it potentially is very
memory consuming. Since we don't distinguish between small and large
messages, we have to dimension the socket receive buffer according
to a worst-case of both. I.e., the window size must be chosen large
enough to sustain a reasonable throughput even for the smallest
messages, while we must still consider a scenario where all messages
are of maximum size. Hence, the current fix window size of 512 messages
and a maximum message size of 66k results in a receive buffer of 66 MB
when truesize(66k) = 131k is taken into account. It is possible to do
much better.

This commit introduces an algorithm where we instead use 1024-byte
blocks as base unit. This unit, always rounded upwards from the
actual message size, is used when we advertise windows as well as when
we count and acknowledge transmitted data. The advertised window is
based on the configured receive buffer size in such a way that even
the worst-case truesize/msgsize ratio always is covered. Since the
smallest possible message size (from a flow control viewpoint) now is
1024 bytes, we can safely assume this ratio to be less than four, which
is the value we are now using.

This way, we have been able to reduce the default receive buffer size
from 66 MB to 2 MB with maintained performance.

In order to keep this solution backwards compatible, we introduce a
new capability bit in the discovery protocol, and use this throughout
the message sending/reception path to always select the right unit.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

10724cc7

tipc: propagate peer node capabilities to socket layer · 60020e18

Jon Paul Maloy authored May 02, 2016

During neighbor discovery, nodes advertise their capabilities as a bit
map in a dedicated 16-bit field in the discovery message header. This
bit map has so far only be stored in the node structure on the peer
nodes, but we now see the need to keep a copy even in the socket
structure.

This commit adds this functionality.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

60020e18

tipc: re-enable compensation for socket receive buffer double counting · 7c8bcfb1

Jon Paul Maloy authored May 02, 2016

In the refactoring commit d570d864 ("tipc: enqueue arrived buffers
in socket in separate function") we did by accident replace the test

if (sk->sk_backlog.len == 0)
     atomic_set(&tsk->dupl_rcvcnt, 0);

with

if (sk->sk_backlog.len)
     atomic_set(&tsk->dupl_rcvcnt, 0);

This effectively disables the compensation we have for the double
receive buffer accounting that occurs temporarily when buffers are
moved from the backlog to the socket receive queue. Until now, this
has gone unnoticed because of the large receive buffer limits we are
applying, but becomes indispensable when we reduce this buffer limit
later in this series.

We now fix this by inverting the mentioned condition.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7c8bcfb1

rtl8152: correct speed testing · 2b84af94

Oliver Neukum authored May 02, 2016

Allow for SS+ USB
Signed-off-by: Oliver Neukum <ONeukum@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2b84af94

usbnet: correct speed testing · ea079842

Oliver Neukum authored May 02, 2016

Allow for SS+ USB
Signed-off-by: Oliver Neukum <ONeukum@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ea079842

brcm80211: correct speed testing · 8caf115c

Oliver Neukum authored May 02, 2016

Allow for SS+ USB
Signed-off-by: Oliver Neukum <ONeukum@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8caf115c

qed: Apply tunnel configurations after PF start · c0f31a05

Manish Chopra authored May 02, 2016

Configure and enable various tunnels on the
adapter after PF start.

This change was missed as a part of
'commit 464f6645
("qed: Add infrastructure support for tunneling")'
Signed-off-by: Manish Chopra <manish.chopra@qlogic.com>
Signed-off-by: Yuval Mintz <yuval.mintz@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c0f31a05

Merge branch 'stmmac-dwmac-socfpga-cleanup' · da7daf5b

David S. Miller authored May 03, 2016

Joachim Eastwood says:

====================
stmmac: dwmac-socfpga refactor+cleanup

This patch aims to remove the init/exit callbacks from the dwmac-
socfpga driver and instead use standard PM callbacks. Doing this
will also allow us to cleanup the driver.

Eventually the init/exit callbacks will be deprecated and removed
from all drivers dwmac-* except for dwmac-generic. Drivers will be
refactored to use standard PM and remove callbacks.

This patch set should not change the behavior of the driver itself,
it only moves code around. The only exception to this is patch
number 4 which restores the resume callback behavior which was
changed in the "net: stmmac: socfpga: Remove re-registration of
reset controller" patch. I belive calling phy_resume() only
from the resume callback and not probe is the right thing to do.

Changes from v1:
 - Rebase on net-next

One heads-up here:
The first patch changes the prototype of a couple of
functions used in Alexandre's "add Ethernet glue logic for
stm32 chip" patch [1] and will cause build failures for
dwmac-stm32.c if not fixed up!
If Alexandre's patch set is applied first I will gladly
rebase my patch set to account for his driver as well.

[1] https://patchwork.ozlabs.org/patch/614405/
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

da7daf5b

stmmac: dwmac-socfpga: kill init() and rename setup() to set_phy_mode() · 0f400a87

Joachim Eastwood authored May 01, 2016

Remove old init callback which now contains only a call to
socfpga_dwmac_setup(). Also rename socfpga_dwmac_setup() to indicate
what the function really does.
Signed-off-by: Joachim Eastwood <manabian@gmail.com>
Tested-by: Marek Vasut <marex@denx.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

0f400a87

stmmac: dwmac-socfpga: call phy_resume() only in resume callback · 53737247

Joachim Eastwood authored May 01, 2016

Calling phy_resume() should only be need during driver resume to
workaround a hardware errata.
Signed-off-by: Joachim Eastwood <manabian@gmail.com>
Tested-by: Marek Vasut <marex@denx.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

53737247

stmmac: dwmac-socfpga: keep a copy of stmmac_rst in driver priv data · 70cb136f

Joachim Eastwood authored May 01, 2016

The dwmac-socfpga driver needs to control the reset usually managed
by the core driver to set the PHY mode. Take a copy of the reset
handle from core priv data so it can be used by the driver later.

This also allow us to move reset handling into socfpga_dwmac_setup()
where the code that needs it is located.
Signed-off-by: Joachim Eastwood <manabian@gmail.com>
Tested-by: Marek Vasut <marex@denx.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

70cb136f

stmmac: dwmac-socfpga: add PM ops and resume function · 56868dee

Joachim Eastwood authored May 01, 2016

Implement the needed PM callbacks in the driver instead of
relying on the init/exit hooks in stmmac_platform. This gives
the driver more flexibility in how the code is organized.

Eventually the init/exit callbacks will be deprecated in favor
of the standard PM callbacks and driver remove function.
Signed-off-by: Joachim Eastwood <manabian@gmail.com>
Tested-by: Marek Vasut <marex@denx.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

56868dee

stmmac: let remove/resume/suspend functions take device pointer · f4e7bd81

Joachim Eastwood authored May 01, 2016

Change stmmac_remove/resume/suspend to take a device pointer so
they can be used directly by drivers that doesn't need to perform
anything device specific.

This lets us remove the PCI pm functions and later simplifiy the
platform drivers.
Signed-off-by: Joachim Eastwood <manabian@gmail.com>
Tested-by: Marek Vasut <marex@denx.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

f4e7bd81

net: ethernet: fec_mpc52xx: move to new ethtool api {get|set}_link_ksettings · e03179fe

Philippe Reynes authored May 01, 2016

The ethtool api {get|set}_settings is deprecated.
We move the fec_mpc52xx driver to new api {get|set}_link_ksettings.
Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e03179fe

net: ethernet: fs-enet: move to new ethtool api {get|set}_link_ksettings · a10cdae0

Philippe Reynes authored May 01, 2016

The ethtool api {get|set}_settings is deprecated.
We move the fs-enet driver to new api {get|set}_link_ksettings.
Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a10cdae0

net: ethernet: ucc: move to new ethtool api {get|set}_link_ksettings · 5e74bf2d

Philippe Reynes authored May 01, 2016

The ethtool api {get|set}_settings is deprecated.
We move the ucc driver to new api {get|set}_link_ksettings.
Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5e74bf2d

net: ethernet: gianfar: move to new ethtool api {get|set}_link_ksettings · 0d1bcdc7

Philippe Reynes authored May 01, 2016

The ethtool api {get|set}_settings is deprecated.
We move the gianfar driver to new api {get|set}_link_ksettings.
Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0d1bcdc7

VSOCK: constify vsock_transport structure · 56130915

Julia Lawall authored May 01, 2016

The vsock_transport structure is never modified, so declare it as const.

Done with the help of Coccinelle.
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>

56130915

drivers: net: xgene: constify xgene_cle_ops structure · b555a3d1

Julia Lawall authored May 01, 2016

The xgene_cle_ops structure is never modified, so declare it as const.

Done with the help of Coccinelle.
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Acked-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b555a3d1

fq_codel: add batch ability to fq_codel_drop() · 9d18562a

Eric Dumazet authored May 01, 2016

In presence of inelastic flows and stress, we can call
fq_codel_drop() for every packet entering fq_codel qdisc.

fq_codel_drop() is quite expensive, as it does a linear scan
of 4 KB of memory to find a fat flow.
Once found, it drops the oldest packet of this flow.

Instead of dropping a single packet, try to drop 50% of the backlog
of this fat flow, with a configurable limit of 64 packets per round.

TCA_FQ_CODEL_DROP_BATCH_SIZE is the new attribute to make this
limit configurable.

With this strategy the 4 KB search is amortized to a single cache line
per drop [1], so fq_codel_drop() no longer appears at the top of kernel
profile in presence of few inelastic flows.

[1] Assuming a 64byte cache line, and 1024 buckets
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dave Taht <dave.taht@gmail.com>
Cc: Jonathan Morton <chromatix99@gmail.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Dave Taht
Signed-off-by: David S. Miller <davem@davemloft.net>

9d18562a

ravb: Remove rx buffer ALIGN · 094e43d5

Kazuya Mizuguchi authored May 02, 2016

Aligning the reception data size is not required.
Signed-off-by: Kazuya Mizuguchi <kazuya.mizuguchi.ks@renesas.com>
Signed-off-by: Yoshihiro Kaneko <ykaneko0929@gmail.com>
Tested-by: Simon Horman <horms+renesas@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

094e43d5

Merge tag 'wireless-drivers-next-for-davem-2016-05-02' of... · ede00a5c

David S. Miller authored May 03, 2016

Merge tag 'wireless-drivers-next-for-davem-2016-05-02' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next

Kalle Valo says:

====================
wireless-drivers patches for 4.7

Major changes:

brcmfmac

* add support for nl80211 BSS_SELECT feature

mwifiex

* add platform specific wakeup interrupt support

ath10k

* implement set_tsf() for 10.2.4 branch
* remove rare MSI range support
* remove deprecated firmware API 1 support

ath9k

* add module parameter to invert LED polarity

wcn36xx

* fixes to get the driver properly working on Dragonboard 410c
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

ede00a5c

net: relax expensive skb_unclone() in iptunnel_handle_offloads() · 9580bf2e

Eric Dumazet authored Apr 30, 2016

Locally generated TCP GSO packets having to go through a GRE/SIT/IPIP
tunnel have to go through an expensive skb_unclone()

Reallocating skb->head is a lot of work.

Test should really check if a 'real clone' of the packet was done.

TCP does not care if the original gso_type is changed while the packet
travels in the stack.

This adds skb_header_unclone() which is a variant of skb_clone()
using skb_header_cloned() check instead of skb_cloned().

This variant can probably be used from other points.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9580bf2e

netdevice: shrink size of struct netdev_queue · c0ef079c

Florian Westphal authored May 03, 2016

- trans_timeout is incremented when tx queue timed out (tx watchdog).
- tx_maxrate is set via sysfs

Moving tx_maxrate to read-mostly part shrinks the struct by 64 bytes.
While at it, also move trans_timeout (it is out-of-place in the
'write-mostly' part).
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c0ef079c

Merge branch 'bridge-per-vlan-stats' · e8194d4f

David S. Miller authored May 02, 2016

Nikolay Aleksandrov says:

====================
bridge: per-vlan stats

This set adds support for bridge per-vlan statistics.
In order to be able to dump statistics for many vlans we need a way to
continue dumping after reaching maximum size, thus patches 01 and 02 extend
the new stats API with a per-device extended link stats attribute and
callback which can save its local state and continue where it left off
afterwards. I considered using the already existing "fill_xstats" callback
but it gets confusing since we need to separate the linkinfo dump from the
new stats api dump and adding a flag/argument to do that just looks messy.
I don't think the rtnl_link_ops size is an issue, so adding these seemed
like the cleaner approach.

Patches 03 and 04 add the stats support and netlink dump support
respectively. The stats accounting is controlled via a bridge option which
is default off, thus the performance impact is kept minimal.
I've tested this set with both old and modified iproute2, kmemleak on and
some traffic stress tests while adding/removing vlans and ports.

v3:
 - drop the RCU pvid patch and remove one pointer fetch as requested
 - make stats accounting optional with default to off, the option is in the
   same cache line as vlan_proto and vlan_enabled, so it is already fetched
   before the fast path check thus the performance impact is minimal, this
   also allows us to avoid one vlan lookup and return early when using pvid
 - rebased and retested

v2:
 - Improve the error checking, rename lidx to prividx and save the current
   idx user instead of restricting it to one in patch 01
 - squash patch 02 into 01 and remove the restriction
 - add callback descriptions, improve the size calculation and change the
   xstats message structure to have an embedding level per rtnl link type
   so we can avoid one call to get the link type (and thus filter on it)
   and also each link type can now have any number of private attributes
   inside
 - fix a problem where the vlan stats are not dumped if the bridge has 0
   vlans on it but has vlans on the ports, add bridge link type private
   attributes and also add paddings for future extensions to avoid at least
   a few netlink attributes and improve struct alignment
 - drop the is_skb_forwardable argument constifying patch as it's not
   needed anymore, but it's a nice cleanup which I'll send separately
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

e8194d4f

bridge: netlink: export per-vlan stats · a60c0903

Nikolay Aleksandrov authored Apr 30, 2016

Add a new LINK_XSTATS_TYPE_BRIDGE attribute and implement the
RTM_GETSTATS callbacks for IFLA_STATS_LINK_XSTATS (fill_linkxstats and
get_linkxstats_size) in order to export the per-vlan stats.
The paddings were added because soon these fields will be needed for
per-port per-vlan stats (or something else if someone beats me to it) so
avoiding at least a few more netlink attributes.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a60c0903

bridge: vlan: learn to count · 6dada9b1

Nikolay Aleksandrov authored Apr 30, 2016

Add support for per-VLAN Tx/Rx statistics. Every global vlan context gets
allocated a per-cpu stats which is then set in each per-port vlan context
for quick access. The br_allowed_ingress() common function is used to
account for Rx packets and the br_handle_vlan() common function is used
to account for Tx packets. Stats accounting is performed only if the
bridge-wide vlan_stats_enabled option is set either via sysfs or netlink.
A struct hole between vlan_enabled and vlan_proto is used for the new
option so it is in the same cache line. Currently it is binary (on/off)
but it is intentionally restricted to exactly 0 and 1 since other values
will be used in the future for different purposes (e.g. per-port stats).
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6dada9b1

net: rtnetlink: add linkxstats callbacks and attribute · 97a47fac

Nikolay Aleksandrov authored Apr 30, 2016

Add callbacks to calculate the size and fill link extended statistics
which can be split into multiple messages and are dumped via the new
rtnl stats API (RTM_GETSTATS) with the IFLA_STATS_LINK_XSTATS attribute.
Also add that attribute to the idx mask check since it is expected to
be able to save state and resume dumping (e.g. future bridge per-vlan
stats will be dumped via this attribute and callbacks).
Each link type should nest its private attributes under the per-link type
attribute. This allows to have any number of separated private attributes
and to avoid one call to get the dev link type.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

97a47fac

net: rtnetlink: allow rtnl_fill_statsinfo to save private state counter · e8872a25

Nikolay Aleksandrov authored Apr 30, 2016

The new prividx argument allows the current dumping device to save a
private state counter which would enable it to continue dumping from
where it left off. And the idxattr is used to save the current idx user
so multiple prividx using attributes can be requested at the same time
as suggested by Roopa Prabhu.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e8872a25

02 May, 2016 11 commits

Merge branch 'ipv6-tunnel-cleanups' · d1ac3b16

David S. Miller authored May 02, 2016

Tom Herbert says:

====================
net: Cleanup IPv6 ip tunnels

The IPv6 tunnel code is very different from IPv4 code. There is a lot
of redundancy with the IPv4 code, particularly in the GRE tunneling.

This patch set cleans up the tunnel code to make the IPv6 code look
more like the IPv4 code and use common functions between the two
stacks where possible.

This work should make it easier to maintain and extend the IPv6 ip
tunnels.

Items in this patch set:
  - Cleanup IPv6 tunnel receive path (ip6_tnl_rcv). Includes using
    gro_cells and exporting ip6_tnl_rcv so the ip6_gre can call it
  - Move GRE functions to common header file (tx functions) or
    gre_demux.c (rx functions like gre_parse_header)
  - Call common GRE functions from IPv6 GRE
  - Create ip6_tnl_xmit (to be like ip_tunnel_xmit)

Tested:
  Ran super_netperf tests for TCP_RR and TCP_STREAM for:
    - IPv4 over gre, gretap, gre6, gre6tap
    - IPv6 over gre, gretap, gre6, gre6tap
    - ipip
    - ip6ip6
    - ipip/gue
    - IPv6 over gre/gue
    - IPv4 over gre/gue
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

d1ac3b16

gre6: Cleanup GREv6 transmit path, call common GRE functions · b05229f4

Tom Herbert authored Apr 29, 2016

Changes in GREv6 transmit path:
  - Call gre_checksum, remove gre6_checksum
  - Rename ip6gre_xmit2 to __gre6_xmit
  - Call gre_build_header utility function
  - Call ip6_tnl_xmit common function
  - Call ip6_tnl_change_mtu, eliminate ip6gre_tunnel_change_mtu
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b05229f4

ipv6: Generic tunnel cleanup · 79ecb90e

Tom Herbert authored Apr 29, 2016

A few generic changes to generalize tunnels in IPv6:
  - Export ip6_tnl_change_mtu so that it can be called by ip6_gre
  - Add tun_hlen to ip6_tnl structure.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

79ecb90e

gre: Create common functions for transmit · 182a352d

Tom Herbert authored Apr 29, 2016

Create common functions for both IPv4 and IPv6 GRE in transmit. These
are put into gre.h.

Common functions are for:
  - GRE checksum calculation. Move gre_checksum to gre.h.
  - Building a GRE header. Move GRE build_header and rename
    gre_build_header.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

182a352d

ipv6: Create ip6_tnl_xmit · 8eb30be0

Tom Herbert authored Apr 29, 2016

This patch renames ip6_tnl_xmit2 to ip6_tnl_xmit and exports it. Other
users like GRE will be able to call this. The original ip6_tnl_xmit
function is renamed to ip6_tnl_start_xmit (this is an ndo_start_xmit
function).
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8eb30be0

gre6: Cleanup GREv6 receive path, call common GRE functions · 308edfdf

Tom Herbert authored Apr 29, 2016

- Create gre_rcv function. This calls gre_parse_header and ip6gre_rcv.
  - Call ip6_tnl_rcv. Doing this and using gre_parse_header eliminates
    most of the code in ip6gre_rcv.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

308edfdf

gre: Move utility functions to common headers · 95f5c64c

Tom Herbert authored Apr 29, 2016

Several of the GRE functions defined in net/ipv4/ip_gre.c are usable
for IPv6 GRE implementation (that is they are protocol agnostic).

These include:
  - GRE flag handling functions are move to gre.h
  - GRE build_header is moved to gre.h and renamed gre_build_header
  - parse_gre_header is moved to gre_demux.c and renamed gre_parse_header
  - iptunnel_pull_header is taken out of gre_parse_header. This is now
    done by caller. The header length is returned from gre_parse_header
    in an int* argument.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

95f5c64c

ipv6: Cleanup IPv6 tunnel receive path · 0d3c703a

Tom Herbert authored Apr 29, 2016

Some basic changes to make IPv6 tunnel receive path look more like
IPv4 path:
  - Make ip6_tnl_rcv non-static so that GREv6 and others can call it
  - Make ip6_tnl_rcv look like ip_tunnel_rcv
  - Switch to gro_cells_receive
  - Make ip6_tnl_rcv non-static and export it
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0d3c703a

Merge branch 'tcp-preempt' · 570d6320

David S. Miller authored May 02, 2016

Eric Dumazet says:

====================
net: make TCP preemptible

Most of TCP stack assumed it was running from BH handler.

This is great for most things, as TCP behavior is very sensitive
to scheduling artifacts.

However, the prequeue and backlog processing are problematic,
as they need to be flushed with BH being blocked.

To cope with modern needs, TCP sockets have big sk_rcvbuf values,
in the order of 16 MB, and soon 32 MB.
This means that backlog can hold thousands of packets, and things
like TCP coalescing or collapsing on this amount of packets can
lead to insane latency spikes, since BH are blocked for too long.

It is time to make UDP/TCP stacks preemptible.

Note that fast path still runs from BH handler.

v2: Added "tcp: make tcp_sendmsg() aware of socket backlog"
    to reduce latency problems of large sends.

v3: Fixed a typo in tcp_cdg.c
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

570d6320

tcp: make tcp_sendmsg() aware of socket backlog · d41a69f1

Eric Dumazet authored Apr 29, 2016

Large sendmsg()/write() hold socket lock for the duration of the call,
unless sk->sk_sndbuf limit is hit. This is bad because incoming packets
are parked into socket backlog for a long time.
Critical decisions like fast retransmit might be delayed.
Receivers have to maintain a big out of order queue with additional cpu
overhead, and also possible stalls in TX once windows are full.

Bidirectional flows are particularly hurt since the backlog can become
quite big if the copy from user space triggers IO (page faults)

Some applications learnt to use sendmsg() (or sendmmsg()) with small
chunks to avoid this issue.

Kernel should know better, right ?

Add a generic sk_flush_backlog() helper and use it right
before a new skb is allocated. Typically we put 64KB of payload
per skb (unless MSG_EOR is requested) and checking socket backlog
every 64KB gives good results.

As a matter of fact, tests with TSO/GSO disabled give very nice
results, as we manage to keep a small write queue and smaller
perceived rtt.

Note that sk_flush_backlog() maintains socket ownership,
so is not equivalent to a {release_sock(sk); lock_sock(sk);},
to ensure implicit atomicity rules that sendmsg() was
giving to (possibly buggy) applications.

In this simple implementation, I chose to not call tcp_release_cb(),
but we might consider this later.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexei Starovoitov <ast@fb.com>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d41a69f1

net: do not block BH while processing socket backlog · 5413d1ba

Eric Dumazet authored Apr 29, 2016

Socket backlog processing is a major latency source.

With current TCP socket sk_rcvbuf limits, I have sampled __release_sock()
holding cpu for more than 5 ms, and packets being dropped by the NIC
once ring buffer is filled.

All users are now ready to be called from process context,
we can unblock BH and let interrupts be serviced faster.

cond_resched_softirq() could be removed, as it has no more user.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

5413d1ba