Commits · 946d3bd7231be3b6202759ea0bea59989ae28c4a · Kirill Smelkov / linux

12 Jun, 2013 4 commits

igmp: remove unnecessary in_device member zeroing · 946d3bd7

Shawn Bohrer authored Jun 07, 2013

ip_mc_init_dev() is passed a freshly kzalloc'd in_device so it is
unnecessary to explicitly zero out the members.
Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

946d3bd7

igmp: hash a hash table to speedup ip_check_mc_rcu() · e9897071

Eric Dumazet authored Jun 07, 2013

After IP route cache removal, multicast applications using
a lot of multicast addresses hit a O(N) behavior in ip_check_mc_rcu()

Add a per in_device hash table to get faster lookup.

This hash table is created only if the number of items in mc_list is
above 4.
Reported-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e9897071

net_sched: htb: do not setup default rate estimators · 64153ce0

Eric Dumazet authored Jun 06, 2013

With a thousand htb classes, est_timer() spends ~5 million cpu cycles
and throws out cpu cache, because each htb class has a default
rate estimator (est 4sec 16sec).

Most users do not use default rate estimators, so switch htb
to not setup ones.

Add a module parameter (htb_rate_est) so that users relying
on this default rate estimator can revert the behavior.

echo 1 >/sys/module/sch_htb/parameters/htb_rate_est
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

64153ce0

net_sched: psched_ratecfg_precompute() improvements · 130d3d68

Eric Dumazet authored Jun 06, 2013

Before allowing 64bits bytes rates, refactor
psched_ratecfg_precompute() to get better comments
and increased accuracy.

rate_bps field is renamed to rate_bytes_ps, as we only
have to worry about bytes per second.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Greear <greearb@candelatech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

130d3d68

11 Jun, 2013 15 commits

net_sched: add 64bit rate estimators · 45203a3b

Eric Dumazet authored Jun 06, 2013

struct gnet_stats_rate_est contains u32 fields, so the bytes per second
field can wrap at 34360Mbit.

Add a new gnet_stats_rate_est64 structure to get 64bit bps/pps fields,
and switch the kernel to use this structure natively.

This structure is dumped to user space as a new attribute :

TCA_STATS_RATE_EST64

Old tc command will now display the capped bps (to 34360Mbit), instead
of wrapped values, and updated tc command will display correct
information.

Old tc command output, after patch :

eric:~# tc -s -d qd sh dev lo
qdisc pfifo 8001: root refcnt 2 limit 1000p
 Sent 80868245400 bytes 1978837 pkt (dropped 0, overlimits 0 requeues 0)
 rate 34360Mbit 189696pps backlog 0b 0p requeues 0

This patch carefully reorganizes "struct Qdisc" layout to get optimal
performance on SMP.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

45203a3b

net: pass correct parameter to skb_headers_offset_update() · b41abb42

Peter Pan(潘卫平) authored Jun 06, 2013

Since commit 1a37e412(net: Use 16bits for *_headers fields of struct
skbuff), skb->*_header are relative to skb->head,
so copy_skb_header() should not call skb_headers_offset_update() now,
and we should pass correct parameter to skb_headers_offset_update() in
pskb_expand_head() and skb_copy_expand().
Signed-off-by: Weiping Pan <panweiping3@gmail.com>
Reviewed-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

b41abb42

netlink: Add compare function for netlink_table · da12c90e

Gao feng authored Jun 06, 2013

As we know, netlink sockets are private resource of
net namespace, they can communicate with each other
only when they in the same net namespace. this works
well until we try to add namespace support for other
subsystems which use netlink.

Don't like ipv4 and route table.., it is not suited to
make these subsytems belong to net namespace, Such as
audit and crypto subsystems,they are more suitable to
user namespace.

So we must have the ability to make the netlink sockets
in same user namespace can communicate with each other.

This patch adds a new function pointer "compare" for
netlink_table, we can decide if the netlink sockets can
communicate with each other through this netlink_table
self-defined compare function.

The behavior isn't changed if we don't provide the compare
function for netlink_table.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

da12c90e

xen-netfront: use skb_partial_csum_set() to simplify the codes · 8249152c

Li RongQing authored Jun 06, 2013

use skb_partial_csum_set() to simplify the codes

Cc: Jason Wang <jasowang@redhat.com>
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8249152c

Merge branch 'bridge_flags' · 2e422069

David S. Miller authored Jun 11, 2013

Vlad Yasevich says:

====================
The following series adds 2 new flags to bridge.  One flag allows
the user to control whether mac learning is performed on the interface
or not.  By default mac learning is on.
The other flag allows the user to control whether unicast traffic
is flooded (send without an fdb) to a given unicast port.  Default is
on.

Changes since v4:
 - Implemented Stephen's suggestions.

Changes since v2:
 - removed unused "unlock" tag.

Changes since v1:
 - Integrated suggestion from MST to not impact RTM_NEWNEIGH and to
   skip lookups when learning is disabled.

Vlad Yasevich (2):
  bridge: Add flag to control mac learning.
  bridge: Add a flag to control unicast packet flood.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

2e422069

bridge: Add a flag to control unicast packet flood. · 867a5943

Vlad Yasevich authored Jun 05, 2013

Add a flag to control flood of unicast traffic.  By default, flood is
on and the bridge will flood unicast traffic if it doesn't know
the destination.  When the flag is turned off, unicast traffic
without an FDB will not be forwarded to the specified port.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

867a5943

bridge: Add flag to control mac learning. · 9ba18891

Vlad Yasevich authored Jun 05, 2013

Allow user to control whether mac learning is enabled on the port.
By default, mac learning is enabled.  Disabling mac learning will
cause new dynamic FDB entries to not be created for a particular port.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

9ba18891

net: remove last caller of skb_tail_offset() and itself · 30f3a40f

Cong Wang authored Jun 05, 2013

Similar to the following commits:

commit 00f97da1 (netpoll: fix position of network header)
commit 525cebed (pktgen: Fix position of ip and udp header)

using skb_tail_offset() seems not correct since the offset
is based on head pointer.

With the last caller removed, skb_tail_offset() can be killed
finally.

Cc: Thomas Graf <tgraf@suug.ch>
Cc: Daniel Borkmann <dborkmann@redhat.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

30f3a40f

Merge branch 'll_poll' · 0a4db187

David S. Miller authored Jun 10, 2013

Eliezer Tamir says:

====================
This patch set adds the ability for the socket layer code to
poll directly on an Ethernet device's RX queue.
This eliminates the cost of the interrupt and context switch
and with proper tuning allows us to get very close to the HW latency.

This is a follow up to Jesse Brandeburg's Kernel Plumbers talk from
last year
http://www.linuxplumbersconf.org/2012/wp-content/uploads/2012/09/2012-lpc-Low-Latency-Sockets-slides-brandeburg.pdf

Patch 1 adds a napi_id and a hashing mechanism to lookup a napi by id.
Patch 2 adds an ndo_ll_poll method and the code that supports it.
Patch 3 adds support for busy-polling on UDP sockets.
Patch 4 adds support for TCP.
Patch 5 adds the ixgbe driver code implementing ndo_ll_poll.
Patch 6 adds additional statistics to the ixgbe driver for ndo_ll_poll.

Performance numbers:
setup TCP_RR UDP_RR
kernel Config C3/6 rx-usecs tps cpu% S.dem tps cpu% S.dem
patched optimized on 100 87k 3.13 11.4 94K 3.17 10.7
patched optimized on 0 71k 3.12 14.0 84k 3.19 12.0
patched optimized on adaptive 80k 3.13 12.5 90k 3.46 12.2
patched typical on 100 72 3.13 14.0 79k 3.17 12.8
patched typical on 0 60k 2.13 16.5 71k 3.18 14.0
patched typical on adaptive 67k 3.51 16.7 75k 3.36 14.5
3.9 optimized on adaptive 25k 1.0 12.7 28k 0.98 11.2
3.9 typical off 0 48k 1.09 7.3 52k 1.11 4.18
3.9 typical 0ff adaptive 35k 1.12 4.08 38k 0.65 5.49
3.9 optimized off adaptive 40k 0.82 4.83 43k 0.70 5.23
3.9 optimized off 0 57k 1.17 4.08 62k 1.04 3.95

Test setup details:
Machines: each with two Intel Xeon 2680 CPUs and X520 (82599) optical
NICs
Tests: Netperf tcp_rr and udp_rr, 1 byte (round trips per second)
Kernel: unmodified 3.9 and patched 3.9
Config: typical is derived from RH6.2, optimized is a stripped down
config.
Interrupt coalescing (ethtool rx-usecs) settings: 0=off, 1=adaptive,
100 us
When C3/6 states were turned on (via BIOS) the performance governor
was used.

These performance numbers were measured with v2 of the patch set.
Performance of the optimized config with an rx-usecs setting of 100
(the first line in the table above) was tracked during the evolution
of the patches and has never varied by more than 1%.

Design:
A global hash table that allows us to look up a struct napi by a
unique id was added.

A napi_id field was added both to struct sk_buff and struct sk.
This is used to track which NAPI we need to poll for a specific
socket.

The device driver marks every incoming skb with this id.
This is propagated to the sk when the socket is looked up in the
protocol handler.

When the socket code does not find any more data on the socket queue,
it now may call ndo_ll_poll which will crank the device's rx queue and
feed incoming packets to the stack directly from the context of the
socket.

A sysctl value (net.core4.low_latency_poll) controls how many
microseconds we busy-wait before giving up. (setting to 0 globally
disables busy-polling)

Locking:

1. Locking between napi poll and ndo_ll_poll:
Since what needs to be locked between a device's NAPI poll and
ndo_ll_poll, is highly device / configuration dependent, we do this
inside the Ethernet driver.
For example, when packets for high priority connections are sent to
separate rx queues, you might not need locking between napi poll and
ndo_ll_poll at all.

For ixgbe we only lock the RX queue.
ndo_ll_poll does not touch the interrupt state or the TX queues.
(earlier versions of this patchset did touch them,
but this design is simpler and works better.)

If a queue is actively polled by a socket (on another CPU) napi poll
will not service it, but will wait until the queue can be locked
and cleaned before doing a napi_complete().
If a socket can't lock the queue because another CPU has it,
either from napi or from another socket polling on the queue,
the socket code can busy wait on the socket's skb queue.

Ndo_ll_poll does not have preferential treatment for the data from the
calling socket vs. data from others, so if another CPU is polling,
you will see your data on this socket's queue when it arrives.

Ndo_ll_poll is called with local BHs disabled, so it won't race on
the same CPU with net_rx_action, which calls the napi poll method.

2. Napi_hash
The napi hash mechanism uses RCU.
napi_by_id() must be called under rcu_read_lock().
After a call to napi_hash_del(), caller must take care to wait an rcu
grace period before freeing the memory containing the napi struct.
(Ixgbe already had this because the queue vector structure uses rcu to
protect the statistics counters in it.)

how to test:

1. The patchset should apply cleanly to net-next.
(don't forget to configure INET_LL_RX_POLL).

2. The ethtool -c setting for rx-usecs should be on the order of 100.

3. Use ethtool -K to disable GRO and LRO
(You are encouraged to try it both ways. If you find that your
workload
does better with GRO on do tell us.)

4. Sysctl value net.core.low_latency_poll controls how long
(in us) to busy-wait for more data, You are encouraged to play
with this and see what works for you. The default is now 0 so you need
to
set it to turn the feature on. I recommend a value around 50.

4. benchmark thread and IRQ should be bound to separate cores.
Both cores should be on the same CPU NUMA node as the NIC.
When the app and the IRQ run on the same CPU you get a small penalty.
If interrupt coalescing is set to a low value this penalty can be very
large.

5. If you suspect that your machine is not configured properly,
use numademo to make sure that the CPU to memory BW is OK.
numademo 128m memcpy local copy numbers should be more than
8GB/s on a properly configured machine.

Change log:
v10
- removed select/poll support. (we will work on this some more and try again)
v9
- correct sysctl proc_handler, reported by Eric Dumazet and Amir Vadai.
- more int -> bool changes, reported by Eric Dumazet.
- better mask testing in sock_poll(), reported by Eric Dumazet.

v8
- split out udp and select/poll into separate patches.
what used to be patch 2/5 is now three patches.
- type corrections from Amir Vadai and Cong Wang:
one unsigned long that was left when changing to cycles_t
int -> bool
- more detailed patch descriptions.

v7
- suggested by Ben Hutchings and Eric Dumazet:
type fixes, static for globals in net/core.c,
avoid napi_id collisions in napi_hash_add()

v6
- many small fixes suggested by Eric Dumazet:
data locality, typos, documentation
protect napi_hash insert/delete with a spinlock (napi_gen_id is no
longer atomic_t since it's only accessed with the spinlock held.)
- added IPv6 TCP and UDP support (only minimally tested)

v5
- corrections suggested by Ben Hutchings:
fixed typos, moved the config option and sysctl value from IPv4 to net
- moved sk_mark_ll() to the protocol handlers
- removed global id mechanism, replaced with a hashed napi_id.
based on code sample from Eric Dumazet
Note that ixgbe_free_q_vector() already waits an rcu grace period
before freeing the q_vector, so nothing additional needs to be done
when adding a call to napi_hash_del().
- simple poll/select support

v4
- removed separate config option for TCP as suggested Eric Dumazet.
- added linux mib counter for packets received through the low latency path,
as suggested by Andi Kleen.
- re-allow module unloading, remove module param, use a global generation id
instead to prevent the use of a stale napi pointer, as suggested
by Eric Dumazet
- updated Documentation/networking/ip-sysctl.txt text

v3
- coding style changes suggested by Dave Miller

v2
- the sysctl knob is now in microseconds. The default value is now 0 (off).
- for now the code depends at configure time on CONFIG_I86_TSC
- the napi reference in struct skb is now a union with the dma cookie
since the former is only used on RX and the latter on TX,
as suggested by Eric Dumazet.
- we do a better job at honoring non-blocking operations.
- removed busy-polling support for tcp_read_sock()
- remove dynamic disabling of GRO
- coding style fixes
- disallow unloading the device module after the feature has been used

Credit:
Jesse Brandeburg, Arun Chekhov Ilango, Julie Cummings,
Alexander Duyck, Eric Geisler, Jason Neighbors, Yadong Li,
Mike Polehn, Anil Vasudevan, Don Wood
Special thanks for finding bugs in earlier versions:
Willem de Bruijn and Andi Kleen
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

0a4db187

ixgbe: add extra stats for ndo_ll_poll · 7e15b90f

Eliezer Tamir authored Jun 10, 2013

Add additional statistics to the ixgbe driver for ndo_ll_poll
Defined under LL_EXTENDED_STATS
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7e15b90f

ixgbe: add support for ndo_ll_poll · 5a85e737

Eliezer Tamir authored Jun 10, 2013

Add the ixgbe driver code implementing ndo_ll_poll.
Adds ndo_ll_poll method and locking between it and the napi poll.
When receiving a packet we use skb_mark_ll to record the napi it came from.
Add each napi to the napi_hash right after netif_napi_add().
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5a85e737

tcp: add low latency socket poll support. · d30e383b

Eliezer Tamir authored Jun 10, 2013

Adds low latency socket poll support for TCP.
In tcp_v[46]_rcv() add a call to sk_mark_ll() to copy the napi_id
from the skb to the sk.
In tcp_recvmsg(), when there is no data in the socket we busy-poll.
This is a good example of how to add busy-poll support to more protocols.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Tested-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d30e383b

udp: add low latency socket poll support · a5b50476

Eliezer Tamir authored Jun 10, 2013

Add upport for busy-polling on UDP sockets.
In __udp[46]_lib_rcv add a call to sk_mark_ll() to copy the napi_id
from the skb into the sk.
This is done at the earliest possible moment, right after we identify
which socket this skb is for.
In __skb_recv_datagram When there is no data and the user
tries to read we busy poll.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Tested-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a5b50476

net: add low latency socket poll · 06021292

Eliezer Tamir authored Jun 10, 2013

Adds an ndo_ll_poll method and the code that supports it.
This method can be used by low latency applications to busy-poll
Ethernet device queues directly from the socket code.
sysctl_net_ll_poll controls how many microseconds to poll.
Default is zero (disabled).
Individual protocol support will be added by subsequent patches.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Tested-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

06021292

net: add napi_id and hash · af12fa6e

Eliezer Tamir authored Jun 10, 2013

Adds a napi_id and a hashing mechanism to lookup a napi by id.
This will be used by subsequent patches to implement low latency
Ethernet device polling.
Based on a code sample by Eric Dumazet.
Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

af12fa6e

10 Jun, 2013 3 commits

bcm63xx_enet: add support for Broadcom BCM63xx integrated gigabit switch · 6f00a022

Maxime Bizon authored Jun 04, 2013

Newer Broadcom BCM63xx SoCs: 6328, 6362 and 6368 have an integrated switch
which needs to be driven slightly differently from the traditional
external switches. This patch introduces changes in arch/mips/bcm63xx in order
to:

- register a bcm63xx_enetsw driver instead of bcm63xx_enet driver
- update DMA channels configuration & state RAM base addresses
- add a new platform data configuration knob to define the number of
  ports per switch/device and force link on some ports
- define the required switch registers

On the driver side, the following changes are required:

- the switch ports need to be polled to ensure the link is up and
  running and RX/TX can properly work
- basic switch configuration needs to be performed for the switch to
  forward packets to the CPU
- update the MIB counters since the integrated
Signed-off-by: Maxime Bizon <mbizon@freebox.fr>
Signed-off-by: Jonas Gorski <jogo@openwrt.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

6f00a022

bcm63xx_enet: split DMA channel register accesses · 0ae99b5f

Maxime Bizon authored Jun 04, 2013

The current bcm63xx_enet driver always uses bcmenet_shared_base whenever
it needs to access DMA channel configuration space or access the DMA
channel state RAM. Split these register in 3 parts to be more accurate:

- global DMA configuration
- per DMA channel configuration space
- per DMA channel state RAM space

This is preliminary to support new chips where the global DMA
configuration remains the same, but there is a varying number of DMA
channels located at a different memory offset.
Signed-off-by: Maxime Bizon <mbizon@freebox.fr>
Signed-off-by: Jonas Gorski <jogo@openwrt.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

0ae99b5f

bcm63xx_enet: implement reset autoneg ethtool callback · 7260aac9

Maxime Bizon authored Jun 04, 2013

Implement the rset_nway ethtool callback which uses libphy generic
autonegotiation restart function.
Signed-off-by: Maxime Bizon <mbizon@freebox.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>

7260aac9

08 Jun, 2013 18 commits