Commits · 94caee8c312d96522bcdae88791aaa9ebcd5f22c · Kirill Smelkov / linux

20 Mar, 2015 40 commits

ebpf: add sched_act_type and map it to sk_filter's verifier ops · 94caee8c

Daniel Borkmann authored Mar 20, 2015

In order to prepare eBPF support for tc action, we need to add
sched_act_type, so that the eBPF verifier is aware of what helper
function act_bpf may use, that it can load skb data and read out
currently available skb fields.

This is bascially analogous to 96be4325 ("ebpf: add sched_cls_type
and map it to sk_filter's verifier ops").

BPF_PROG_TYPE_SCHED_CLS and BPF_PROG_TYPE_SCHED_ACT need to be
separate since both will have a different set of functionality in
future (classifier vs action), thus we won't run into ABI troubles
when the point in time comes to diverge functionality from the
classifier.

The future plan for act_bpf would be that it will be able to write
into skb->data and alter selected fields mirrored in struct __sk_buff.

For an initial support, it's sufficient to map it to sk_filter_ops.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
Reviewed-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

94caee8c

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 0fa74a4b

David S. Miller authored Mar 20, 2015

Conflicts:
	drivers/net/ethernet/emulex/benet/be_main.c
	net/core/sysctl_net_core.c
	net/ipv4/inet_diag.c

The be_main.c conflict resolution was really tricky.  The conflict
hunks generated by GIT were very unhelpful, to say the least.  It
split functions in half and moved them around, when the real actual
conflict only existed solely inside of one function, that being
be_map_pci_bars().

So instead, to resolve this, I checked out be_main.c from the top
of net-next, then I applied the be_main.c changes from 'net' since
the last time I merged.  And this worked beautifully.

The inet_diag.c and sysctl_net_core.c conflicts were simple
overlapping changes, and were easily to resolve.
Signed-off-by: David S. Miller <davem@davemloft.net>

0fa74a4b

rhashtable: Fix undeclared EEXIST build error on ia64 · 6626af69

Herbert Xu authored Mar 20, 2015

We need to include linux/errno.h in rhashtable.h since it doesn't
always get included otherwise.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

6626af69

net: validate the range we feed to iov_iter_init() in sys_sendto/sys_recvfrom · 4de930ef

Al Viro authored Mar 20, 2015

Cc: stable@vger.kernel.org # v3.19
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

4de930ef

Merge branch 'amd-xgbe-next' · b4c11cb4

David S. Miller authored Mar 20, 2015

Tom Lendacky says:

====================
amd-xgbe: AMD XGBE driver updates 2015-03-19

The following series of patches includes functional updates and changes
to the driver.

- Use the phydev->advertising field instead of the phydev->supported
  field when configuring for auto-negotiation, etc.
- Use the phy_driver flags field for setting the transceiver type
  instead of hardcoding it in the ethtool support.
- Provide an auto-negotiation timeout check
- Clarify the Tx/Rx queue information messages
- Use the new DMA memory barrier operations
- Set the device DMA mask based on what the hardware reports
- Remove the software implementation of Tx coalescing
- Fix the reporting of the Rx coalescing value
- Use napi_alloc_skb when allocating an SKB in softirq

This patch series is based on net-next.

Changes from v2:
- Use jiffies instead of timespec for the auto-negotiation timeout check
- Remove the Rx path SKB allocation re-work patch since we should only
  inline the headers and the current code guards better against any
  hardware bugs

Changes from v1:
- Default to 32-bit DMA width (minimum supported) if hardware returns
  an unexpected DMA width value
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

b4c11cb4

amd-xgbe: Use napi_alloc_skb when allocating skb in softirq · 385565a1

Lendacky, Thomas authored Mar 20, 2015

Use the napi_alloc_skb function to allocate an skb when running within
the softirq context to avoid calls to local_irq_save/restore.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

385565a1

amd-xgbe: Fix Rx coalescing reporting · 4a57ebcc

Lendacky, Thomas authored Mar 20, 2015

The Rx coalescing value is internally converted from usecs to a value
that the hardware can use. When reporting the Rx coalescing value, this
internal value is converted back to usecs. During the conversion from
and back to usecs some rounding occurs. So, for example, when setting an
Rx usec of 30, it will be reported as 29. Fix this reporting issue by
keeping the original usec value and using that during reporting.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4a57ebcc

amd-xgbe: Remove Tx coalescing · c635eaac

Lendacky, Thomas authored Mar 20, 2015

The Tx coalescing support in the driver was a software implementation
for something lacking in the hardware. Using hrtimers, the idea was to
trigger a timer interrupt after having queued a packet for transmit.
Unfortunately, as the timer value was lowered, the timer expired before
the hardware actually did the transmit and so it was racey and resulted
in unnecessary interrupts.

Remove the Tx coalescing support and hrtimer and replace with a Tx timer
that is used as a reclaim timer in case of inactivity.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c635eaac

amd-xgbe: Set DMA mask based on hardware register value · 386d325d

Lendacky, Thomas authored Mar 20, 2015

The hardware supplies a value that indicates the DMA range that it
is capable of using. Use this value rather than hard-coding it in
the driver.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

386d325d

amd-xgbe: Use the new DMA memory barriers where appropriate · ceb8f6be

Lendacky, Thomas authored Mar 20, 2015

Use the new lighter weight memory barriers when working with the device
descriptors.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ceb8f6be

amd-xgbe: Clarify output message about queues · 600c8811

Lendacky, Thomas authored Mar 20, 2015

Clarify that the queues referred to in a message when the device is
brought up are hardware queues and not necessarily related to the
Linux network queues.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

600c8811

amd-xgbe-phy: Provide support for auto-negotiation timeout · 9ae5eecd

Lendacky, Thomas authored Mar 20, 2015

Currently, there is no interrupt code that indicates auto-negotiation
has timed out. If the auto-negotiation has timed out then the start of
a new auto-negotiation will begin again with a new base page being
received. The state machine could be in a state that is not expecting
this interrupt code which results in an error during auto-negotiation.

Update the code to timestamp when the auto-negotiation starts. Should
another page received interrupt code occur before auto-negotiation has
completed but after the auto-negotiation timeout, then reset the state
machine to allow the auto-negotiation to continue.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9ae5eecd

amd-xgbe-phy: Use the phy_driver flags field · 65f57cb1

Lendacky, Thomas authored Mar 20, 2015

Remove the setting of the transceiver type when retrieving the device
settings using ethtool and instead set the transceiver type in the
phy_driver structure flags field. Change the transceiver type to be
internal, also.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

65f57cb1

amd-xgbe-phy: Use phydev advertising field vs supported · d9663c8c

Lendacky, Thomas authored Mar 20, 2015

With ethtool being able to control what is advertised, the advertising
field is what should be used for priming the auto-negotiation registers
and for various other checks, instead of the supported field.

Also, move the initial setting of the supported and advertising fields
into the probe function so that they are not reset each time the device
is brought up, thus allowing the user to set as desired before bringing
the device up.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d9663c8c

net: compat: Update get_compat_msghdr() to match copy_msghdr_from_user() behaviour · 91edd096

Catalin Marinas authored Mar 20, 2015

Commit db31c55a (net: clamp ->msg_namelen instead of returning an
error) introduced the clamping of msg_namelen when the unsigned value
was larger than sizeof(struct sockaddr_storage). This caused a
msg_namelen of -1 to be valid. The native code was subsequently fixed by
commit dbb490b9 (net: socket: error on a negative msg_namelen).

In addition, the native code sets msg_namelen to 0 when msg_name is
NULL. This was done in commit (6a2a2b3a net:socket: set msg_namelen
to 0 if msg_name is passed as NULL in msghdr struct from userland) and
subsequently updated by 08adb7da (fold verify_iovec() into
copy_msghdr_from_user()).

This patch brings the get_compat_msghdr() in line with
copy_msghdr_from_user().

Fixes: db31c55a (net: clamp ->msg_namelen instead of returning an error)
Cc: David S. Miller <davem@davemloft.net>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

91edd096

Merge branch 'rhashtable-inlined-interface' · ebd6af09

David S. Miller authored Mar 20, 2015

Herbert Xu says:

====================
rhashtable: Introduce inlined interface

This series of patches introduces the inlined rhashtable interface.

The idea is to make all the function pointers visible to the compiler
by providing the rhashtable_params structure explicitly to each
inline rhashtable function.  For example, instead of doing

	obj = rhashtable_lookup(ht, key);

you would now do

	obj = rhashtable_lookup_fast(ht, key, params);

Where params is the same data that you would give to rhashtable_init.
In particular, within rhashtable.c itself we would simply supply
ht->p.

So to convert users over, you simply have to make params globally
accessible, e.g., by placing it in a static const variable, which
can then be used at each inlined call site, as well as by the
rhashtable_init call.

The only ticky bit is that some users (i.e., netfilter) has a
dynamic key length.  This is dealt with by using params.key_len
in the inline functions when it is non-zero, and otherwise falling
back on ht->p.key_len.

Note that I've only tested this on one compiler, gcc 4.7.2.  So
please test this with your compilers as well and make sure that
the code is actually inlined without indirect function calls.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

ebd6af09

rhashtable: Rip out obsolete out-of-line interface · dc0ee268

Herbert Xu authored Mar 20, 2015

Now that all rhashtable users have been converted over to the
inline interface, this patch removes the unused out-of-line
interface.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

dc0ee268

tipc: Use inlined rhashtable interface · 6cca7289

Herbert Xu authored Mar 20, 2015

This patch converts tipc to the inlined rhashtable interface.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

6cca7289

test_rhashtable: Use inlined rhashtable interface · b182aa6e

Herbert Xu authored Mar 20, 2015

This patch converts test_rhashtable to the inlined rhashtable
interface.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

b182aa6e

netfilter: Convert nft_hash to inlined rhashtable · fa377321

Herbert Xu authored Mar 20, 2015

This patch converts nft_hash to the inlined rhashtable interface.

This patch also replaces the call to rhashtable_lookup_compare with
a straight rhashtable_lookup_fast because it's simply doing a memcmp
(in fact nft_hash_lookup already uses memcmp instead of nft_data_cmp).

Furthermore, the compare function is only meant to compare, it is not
supposed to have side-effects.  The current side-effect code can
simply be moved into the nft_hash_get.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

fa377321

netlink: Move namespace into hash key · c428ecd1

Herbert Xu authored Mar 20, 2015

Currently the name space is a de facto key because it has to match
before we find an object in the hash table.  However, it isn't in
the hash value so all objects from different name spaces with the
same port ID hash to the same bucket.

This is bad as the number of name spaces is unbounded.

This patch fixes this by using the namespace when doing the hash.

Because the namespace field doesn't lie next to the portid field
in the netlink socket, this patch switches over to the rhashtable
interface without a fixed key.

This patch also uses the new inlined rhashtable interface where
possible.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

c428ecd1

rhashtable: Allow hash/comparison functions to be inlined · 02fd97c3

Herbert Xu authored Mar 20, 2015

This patch deals with the complaint that we make indirect function
calls on the fast paths unnecessarily in rhashtable.  We resolve
it by moving the fast paths into inline functions that take struct
rhashtable_param (which obviously must be the same set of parameters
supplied to rhashtable_init) as an argument.

The only remaining indirect call is to obj_hashfn (or key_hashfn it
obj_hashfn is unset) on the rehash as well as the insert-during-
rehash slow path.

This patch also extends the support of vairable-length keys to
include those where the key is fixed but scattered in the object.
For example, in netlink we want to key off the namespace and the
portid but they're not next to each other.

This patch does this by directly using the object hash function
as the indicator of whether the key is accessible or not.  It
also adds a new function obj_cmpfn to compare a key against an
object.  This means that the caller no longer needs to supply
explicit compare functions.

All this is done in a backwards compatible manner so no existing
users are affected until they convert to the new interface.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

02fd97c3

rhashtable: Make rhashtable_init params argument const · 488fb86e

Herbert Xu authored Mar 20, 2015

This patch marks the rhashtable_init params argument const as
there is no reason to modify it since we will always make a copy
of it in the rhashtable.

This patch also fixes a bug where we don't actually round up the
value of min_size unless it is less than HASH_MIN_SIZE.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

488fb86e

ebpf, filter: do not convert skb->protocol to host endianess during runtime · 0b8c707d

Daniel Borkmann authored Mar 19, 2015

Commit c2497395 ("bpf: allow BPF programs access 'protocol' and 'vlan_tci'
fields") has added support for accessing protocol, vlan_present and vlan_tci
into the skb offset map.

As referenced in the below discussion, accessing skb->protocol from an eBPF
program should be converted without handling endianess.

The reason for this is that an eBPF program could simply do a check more
naturally, by f.e. testing skb->protocol == htons(ETH_P_IP), where the LLVM
compiler resolves htons() against a constant automatically during compilation
time, as opposed to an otherwise needed run time conversion.

After all, the way of programming both from a user perspective differs quite
a lot, i.e. bpf_asm ["ld proto"] versus a C subset/LLVM.

Reference: https://patchwork.ozlabs.org/patch/450819/Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0b8c707d

ipv6: invert join/leave anycast rtnl/socket locking order · c4a6853d

Marcelo Ricardo Leitner authored Mar 20, 2015

Commit baf606d9 ("ipv4,ipv6: grab rtnl before locking the socket")
missed to update two setsockopt options, IPV6_JOIN_ANYCAST and
IPV6_LEAVE_ANYCAST, causing a lock inverstion regarding to the updated ones.

As ipv6_sock_ac_join and ipv6_sock_ac_leave are only called from
do_ipv6_setsockopt, we are good to just move the rtnl lock upper.

Fixes: baf606d9 ("ipv4,ipv6: grab rtnl before locking the socket")
Reported-by: Ying Huang <ying.huang@intel.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c4a6853d

vxlan: fix possible use of uninitialized in vxlan_igmp_{join, leave} · 149d7549

Marcelo Ricardo Leitner authored Mar 20, 2015

Test robot noticed that we check the return of vxlan_igmp_join and leave
but inside them there was a path that it could be used initialized.

It's not really possible because those if() inside these igmp functions
would always match as we can't have sockets of other type in there, but
this way we keep the compiler happy.

Fixes: 56ef9c90 ("vxlan: Move socket initialization to within rtnl scope")
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

149d7549

Merge branch 'be2net' · de58a6da

David S. Miller authored Mar 20, 2015

Sathya Perla says:

====================
be2net: patch set

Hi David, this patch set includes 3 bug fixes to the be2net driver.

Patch 1 fixes a vlan isolation issue with VFs. When a VF is placed in
promiscous mode, it could receive packets belonging to any vlan, as
the PF driver grants vlan promisc capability to VFs. The PF
driver now disables the vlan promisc capability for VFs to fix this
problem.

Patch 2 fixes the call to MODIFY_EQ_DELAY FW cmd to not include more
than 8 EQs per cmd. The FW is not capable of handling more than 8 EQs
per cmd.

Patch 3 fixes an EEH error detection issue. On Power platforms,
when an EEH error occurs, the slot disconnect state is more reliably
detected via an MMIO read compared to a config read. So, the error
register reads that occur every second are now done via MMIO.

Pls apply this patch set to the "net" tree. Thanks!
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

de58a6da

be2net: use PCI MMIO read instead of config read for errors · 25848c90

Suresh Reddy authored Mar 20, 2015

When an EEH error occurs, the device/slot is disconnected. This condition
is more reliably detected (i.e., returns all ones) with an MMIO read rather
than a config read -- especially on power platforms.

Hence, this patch fixes EEH error detection by replacing config reads with
MMIO reads for reading the error registers. The error registers in
Skyhawk-R/BE2/BE3 are accessible both via the config space and the
PCICFG (BAR0) memory space.
Reported-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Suresh Reddy <Suresh.Reddy@emulex.com>
Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

25848c90

be2net: restrict MODIFY_EQ_DELAY cmd to a max of 8 EQs · c8ba4ad0

Suresh Reddy authored Mar 20, 2015

Issuing this cmd for more than 8 EQs does not have the intended effect
even on BEx and Skyhawk-R.

This patch fixes this by issuing this cmd for upto 8 EQs at a time.
Signed-off-by: Suresh Reddy <Suresh.Reddy@emulex.com>
Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c8ba4ad0

be2net: Prevent VFs from enabling VLAN promiscuous mode · 435452aa

Vasundhara Volam authored Mar 20, 2015

Currently, a PF does not restrict its VF interface from enabling vlan
promiscuous mode. This breaks vlan isolation when a vlan
(transparent tagging) is configured on a VF.

This patch fixes this problem by disabling the vlan promisc capability
for VFs.
Reported-by: Yoann Juet <veilletechno-irts@univ-nantes.fr>
Signed-off-by: Vasundhara Volam <vasundhara.volam@emulex.com>
Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

435452aa

tcp: fix tcp fin memory accounting · d22e1537

Josh Hunt authored Mar 19, 2015

tcp_send_fin() does not account for the memory it allocates properly, so
sk_forward_alloc can be negative in cases where we've sent a FIN:

ss example output (ss -amn | grep -B1 f4294):
tcp    FIN-WAIT-1 0      1            192.168.0.1:45520         192.0.2.1:8080
	skmem:(r0,rb87380,t0,tb87380,f4294966016,w1280,o0,bl0)
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d22e1537

ipv6: fix backtracking for throw routes · 73ba57bf

Steven Barth authored Mar 19, 2015

for throw routes to trigger evaluation of other policy rules
EAGAIN needs to be propagated up to fib_rules_lookup
similar to how its done for IPv4

A simple testcase for verification is:

ip -6 rule add lookup 33333 priority 33333
ip -6 route add throw 2001:db8::1
ip -6 route add 2001:db8::1 via fe80::1 dev wlan0 table 33333
ip route get 2001:db8::1
Signed-off-by: Steven Barth <cyrus@openwrt.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

73ba57bf

net: ethernet: pcnet32: Setup the SRAM and NOUFLO on Am79C97{3, 5} · 87f966d9

Markos Chandras authored Mar 19, 2015

On a MIPS Malta board, tons of fifo underflow errors have been observed
when using u-boot as bootloader instead of YAMON. The reason for that
is that YAMON used to set the pcnet device to SRAM mode but u-boot does
not. As a result, the default Tx threshold (64 bytes) is now too small to
keep the fifo relatively used and it can result to Tx fifo underflow errors.
As a result of which, it's best to setup the SRAM on supported controllers
so we can always use the NOUFLO bit.

Cc: <netdev@vger.kernel.org>
Cc: <stable@vger.kernel.org>
Cc: <linux-kernel@vger.kernel.org>
Cc: Don Fry <pcnet32@frontier.com>
Signed-off-by: Markos Chandras <markos.chandras@imgtec.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

87f966d9

ipv6: call ipv6_proxy_select_ident instead of ipv6_select_ident in udp6_ufo_fragment · 8e199dfd

Sabrina Dubroca authored Mar 19, 2015

Matt Grant reported frequent crashes in ipv6_select_ident when
udp6_ufo_fragment is called from openvswitch on a skb that doesn't
have a dst_entry set.

ipv6_proxy_select_ident generates the frag_id without using the dst
associated with the skb.  This approach was suggested by Vladislav
Yasevich.

Fixes: 0508c07f ("ipv6: Select fragment id during UFO segmentation if not set.")
Cc: Vladislav Yasevich <vyasevic@redhat.com>
Reported-by: Matt Grant <matt@mattgrant.net.nz>
Tested-by: Matt Grant <matt@mattgrant.net.nz>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8e199dfd

xen-netback: making the bandwidth limiter runtime settable · edafc132

Palik, Imre authored Mar 19, 2015

With the current netback, the bandwidth limiter's parameters are only
settable during vif setup time.  This patch register a watch on them, and
thus makes them runtime changeable.

When the watch fires, the timer is reset.  The timer's mutex is used for
fencing the change.

Cc: Anthony Liguori <aliguori@amazon.com>
Signed-off-by: Imre Palik <imrep@amazon.de>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

edafc132

Merge branch 'listener_refactor_part_14' · 750f2f91

David S. Miller authored Mar 20, 2015

Eric Dumazet says:

====================
inet: tcp listener refactoring part 14

OK, we have serious patches here.

We get rid of the central timer handling SYNACK rtx,
which is killing us under even medium SYN flood.

We still use the listener specific hash table.

This will be done in next round ;)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

750f2f91

net: increase sk_[max_]ack_backlog · becb74f0

Eric Dumazet authored Mar 19, 2015

sk_ack_backlog & sk_max_ack_backlog were 16bit fields, meaning
listen() backlog was limited to 65535.

It is time to increase the width to allow much bigger backlog,
if admins change /proc/sys/net/core/somaxconn &
/proc/sys/net/ipv4/tcp_max_syn_backlog default values.

Tested:

echo 5000000 >/proc/sys/net/core/somaxconn
echo 5000000 >/proc/sys/net/ipv4/tcp_max_syn_backlog

Ran a SYNFLOOD test against a listener using listen(fd, 5000000)

myhost~# grep request_sock_TCP /proc/slabinfo
request_sock_TCP  4185642 4411940    304   13    1 : tunables   54   27    8 : slabdata 339380 339380      0
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

becb74f0

inet: get rid of central tcp/dccp listener timer · fa76ce73

Eric Dumazet authored Mar 19, 2015

One of the major issue for TCP is the SYNACK rtx handling,
done by inet_csk_reqsk_queue_prune(), fired by the keepalive
timer of a TCP_LISTEN socket.

This function runs for awful long times, with socket lock held,
meaning that other cpus needing this lock have to spin for hundred of ms.

SYNACK are sent in huge bursts, likely to cause severe drops anyway.

This model was OK 15 years ago when memory was very tight.

We now can afford to have a timer per request sock.

Timer invocations no longer need to lock the listener,
and can be run from all cpus in parallel.

With following patch increasing somaxconn width to 32 bits,
I tested a listener with more than 4 million active request sockets,
and a steady SYNFLOOD of ~200,000 SYN per second.
Host was sending ~830,000 SYNACK per second.

This is ~100 times more what we could achieve before this patch.

Later, we will get rid of the listener hash and use ehash instead.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fa76ce73

inet: drop prev pointer handling in request sock · 52452c54

Eric Dumazet authored Mar 19, 2015

When request sock are put in ehash table, the whole notion
of having a previous request to update dl_next is pointless.

Also, following patch will get rid of big purge timer,
so we want to delete a request sock without holding listener lock.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

52452c54

rhashtable: Round up/down min/max_size to ensure we respect limit · a998f712

Thomas Graf authored Mar 19, 2015

Round up min_size respectively round down max_size to the next power
of two to make sure we always respect the limit specified by the
user. This is required because we compare the table size against the
limit before we expand or shrink.

Also fixes a minor bug where we modified min_size in the params
provided instead of the copy stored in struct rhashtable.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

a998f712