Commits · 62fc8051083a334578c3f4b3488808f210b4565f · Kirill Smelkov / linux

11 May, 2010 3 commits

netfilter: xtables: deconstify struct xt_action_param for matches · 62fc8051

Jan Engelhardt authored Jul 07, 2009

In future, layer-3 matches will be an xt module of their own, and
need to set the fragoff and thoff fields. Adding more pointers would
needlessy increase memory requirements (esp. so for 64-bit, where
pointers are wider).
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>

62fc8051

netfilter: xtables: substitute temporary defines by final name · 4b560b44
Jan Engelhardt authored Jul 05, 2009
```
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
```
4b560b44

netfilter: xtables: combine struct xt_match_param and xt_target_param · de74c169

Jan Engelhardt authored Jul 05, 2009

The structures carried - besides match/target - almost the same data.
It is possible to combine them, as extensions are evaluated serially,
and so, the callers end up a little smaller.

  text  data  bss  filename
-15318   740  104  net/ipv4/netfilter/ip_tables.o
+15286   740  104  net/ipv4/netfilter/ip_tables.o
-15333   540  152  net/ipv6/netfilter/ip6_tables.o
+15269   540  152  net/ipv6/netfilter/ip6_tables.o
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>

de74c169

02 May, 2010 3 commits
- netfilter: xtables: dissolve do_match function · ef53d702
  Jan Engelhardt authored Jul 09, 2009
```
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
```
  ef53d702
- netfilter: xtables: fix incorrect return code · c29c9492
  Jan Engelhardt authored May 02, 2010
```
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
```
  c29c9492
- netfilter: ip_tables: fix compilation when debug is enabled · b5cad0df
  Jan Engelhardt authored May 02, 2010
```
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
```
  b5cad0df
27 Apr, 2010 1 commit

netfilter: x_tables: rectify XT_FUNCTION_MAXNAMELEN usage · 4b2cbd42

Jan Engelhardt authored Apr 27, 2010

There has been quite a confusion in userspace about
XT_FUNCTION_MAXNAMELEN; because struct xt_entry_match used MAX-1,
userspace would have to do an awkward MAX-2 for maximum length
checking (due to '\0'). This patch adds a new define that matches the
definition of XT_TABLE_MAXNAMELEN - being the size of the actual
struct member, not one off.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>

4b2cbd42

23 Apr, 2010 1 commit

netfilter: nf_conntrack: extend with extra stat counter · af740b2c

Jesper Dangaard Brouer authored Apr 23, 2010

I suspect an unfortunatly series of events occuring under a DDoS
attack, in function __nf_conntrack_find() nf_contrack_core.c.

Adding a stats counter to see if the search is restarted too often.
Signed-off-by: Jesper Dangaard Brouer <hawk@comx.dk>
Signed-off-by: Patrick McHardy <kaber@trash.net>

af740b2c

22 Apr, 2010 1 commit

netfilter: ip_tables: convert pr_devel() to pr_debug() · cecc74de

Patrick McHardy authored Apr 22, 2010

We want to be able to use CONFIG_DYNAMIC_DEBUG in netfilter code, switch
the few existing pr_devel() calls to pr_debug().
Signed-off-by: Patrick McHardy <kaber@trash.net>

cecc74de

21 Apr, 2010 1 commit

netfilter: x_tables: move sleeping allocation outside BH-disabled region · d97a9e47

Jan Engelhardt authored Apr 21, 2010

The jumpstack allocation needs to be moved out of the critical region.
Corrects this notice:

BUG: sleeping function called from invalid context at mm/slub.c:1705
[  428.295762] in_atomic(): 1, irqs_disabled(): 0, pid: 9111, name: iptables
[  428.295771] Pid: 9111, comm: iptables Not tainted 2.6.34-rc1 #2
[  428.295776] Call Trace:
[  428.295791]  [<c012138e>] __might_sleep+0xe5/0xed
[  428.295801]  [<c019e8ca>] __kmalloc+0x92/0xfc
[  428.295825]  [<f865b3bb>] ? xt_jumpstack_alloc+0x36/0xff [x_tables]
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>

d97a9e47

20 Apr, 2010 6 commits

netfilter: bridge-netfilter: fix refragmenting IP traffic encapsulated in PPPoE traffic · 6c79bf0f

Bart De Schuymer authored Apr 20, 2010

The MTU for IP traffic encapsulated inside PPPoE traffic is smaller
than the MTU of the Ethernet device (1500). Connection tracking
gathers all IP packets and sometimes will refragment them in
ip_fragment(). We then need to subtract the length of the
encapsulating header from the mtu used in ip_fragment(). The check in
br_nf_dev_queue_xmit() which determines if ip_fragment() has to be
called is also updated for the PPPoE-encapsulated packets.
nf_bridge_copy_header() is also updated to make sure the PPPoE data
length field has the correct value.
Signed-off-by: Bart De Schuymer <bdschuym@pandora.be>
Signed-off-by: Patrick McHardy <kaber@trash.net>

6c79bf0f

Merge branch 'master' of /repos/git/net-next-2.6 · 62910554

Patrick McHardy authored Apr 20, 2010

Conflicts:
	Documentation/feature-removal-schedule.txt
	net/ipv6/netfilter/ip6t_REJECT.c
	net/netfilter/xt_limit.c
Signed-off-by: Patrick McHardy <kaber@trash.net>

62910554

netfilter: xt_TEE: resolve oif using netdevice notifiers · 22265a5c

Patrick McHardy authored Apr 20, 2010

Replace the runtime oif name resolving by netdevice notifier based
resolving. When an oif is given, a netdevice notifier is registered
to resolve the name on NETDEV_REGISTER or NETDEV_CHANGE and unresolve
it again on NETDEV_UNREGISTER or NETDEV_CHANGE to a different name.
Signed-off-by: Patrick McHardy <kaber@trash.net>

22265a5c

net: emphasize rtnl lock required in call_netdevice_notifiers · ab930471

Jiri Pirko authored Apr 20, 2010

Since netdev_chain is guarded by rtnl_lock, ASSERT_RTNL should be
present here to make sure that all callers of call_netdevice_notifiers
does the locking properly.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ab930471

rps: consistent rxhash · b249dcb8

Eric Dumazet authored Apr 19, 2010

In case we compute a software skb->rxhash, we can generate a consistent
hash : Its value will be the same in both flow directions.

This helps some workloads, like conntracking, since the same state needs
to be accessed in both directions.

tbench + RFS + this patch gives better results than tbench with default
kernel configuration (no RPS, no RFS)

Also fixed some sparse warnings.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b249dcb8

rps: cleanups · e36fa2f7

Eric Dumazet authored Apr 19, 2010

struct softnet_data holds many queues, so consistent use "sd" name
instead of "queue" is better.

Adds a rps_ipi_queued() helper to cleanup enqueue_to_backlog()

Adds a _and_irq_disable suffix to net_rps_action() name, as David
suggested.

incr_input_queue_head() becomes input_queue_head_incr()
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e36fa2f7

19 Apr, 2010 17 commits

rps: static functions · f5acb907

Eric Dumazet authored Apr 19, 2010

store_rps_map() & store_rps_dev_flow_table_cnt() are static.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f5acb907

rps: shortcut net_rps_action() · 88751275

Eric Dumazet authored Apr 19, 2010

net_rps_action() is a bit expensive on NR_CPUS=64..4096 kernels, even if
RPS is not active.

Tom Herbert used two bitmasks to hold information needed to send IPI,
but a single LIFO list seems more appropriate.

Move all RPS logic into net_rps_action() to cleanup net_rx_action() code
(remove two ifdefs)

Move rps_remote_softirq_cpus into softnet_data to share its first cache
line, filling an existing hole.

In a future patch, we could call net_rps_action() from process_backlog()
to make sure we send IPI before handling this cpu backlog.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

88751275

bnx2x: Date and version · a03b1a5c

Vladislav Zolotarov authored Apr 19, 2010

Set version to 1.52.53-1.
Signed-off-by: Vladislav Zolotarov <vladz@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a03b1a5c

bnx2x: Don't report link down if has been already down · d9e8b185

Vladislav Zolotarov authored Apr 19, 2010

Author: Yaniv Rosner <yanivr@broadcom.com>
Signed-off-by: Yaniv Rosner <yanivr@broadcom.com>
Signed-off-by: Vladislav Zolotarov <vladz@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d9e8b185

bnx2x: Rework power state handling code · d3dbfee0

Vladislav Zolotarov authored Apr 19, 2010

Move "don't shut down the power" logic into bnx2x_set_power_state() to make the code cleaner.
Signed-off-by: Vladislav Zolotarov <vladz@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d3dbfee0

bnx2x: use mask in test_registers() to avoid parity error · 8eb5a20c

Vladislav Zolotarov authored Apr 19, 2010

Properly mask the value to be written to the register (according to the register size) during the self-test.
Otherwise immediate parity error would be generated.
Signed-off-by: Vladislav Zolotarov <vladz@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8eb5a20c

bnx2x: Fixed MSI-X enabling flow · 1ac218c8

Vladislav Zolotarov authored Apr 19, 2010

Try to enable less MSI-X vectors if initial request has failed.

Author: Dmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: Vladislav Zolotarov <vladz@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1ac218c8

bnx2x: Added new statistics · dea7aab1

Vladislav Zolotarov authored Apr 19, 2010

Added total_mcast/bcast_pkts_transmitted statistics.

Author: Dmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: Vladislav Zolotarov <vladz@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dea7aab1

bnx2x: White spaces · cdaa7cb8

Vladislav Zolotarov authored Apr 19, 2010

White spaces, code readability and prints.
Signed-off-by: Vladislav Zolotarov <vladz@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cdaa7cb8

bnx2x: Protect code with NOMCP · 2145a920

Vladislav Zolotarov authored Apr 19, 2010

Don't run code that can't be run if MCP is not present.
This will prevent NULL pointer dereferencing.
Signed-off-by: Vladislav Zolotarov <vladz@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2145a920

bnx2x: Increase DMAE max write size for 57711 · 02e3c6cb

Vladislav Zolotarov authored Apr 19, 2010

Increase DMAE max write size for 57711 to the maximum allowed value.
Signed-off-by: Vladislav Zolotarov <vladz@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

02e3c6cb

bnx2x: Use VPD-R V0 entry to display firmware revision · 34f24c7f

Vladislav Zolotarov authored Apr 19, 2010

Author: Dmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: Vladislav Zolotarov <vladz@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

34f24c7f

bnx2x: Parity errors handling for 57710 and 57711 · 72fd0718

Vladislav Zolotarov authored Apr 19, 2010

This patch introduces the parity errors handling code for 57710 and 57711 chips.

HW is configured to stop all DMA transactions to the host and sending packets to the network
once parity error is detected, which is meant to prevent silent data corruption.
At the same time HW generates the attention interrupt to every function of the device where parity
has been detected so that driver can start the recovery flow.

The recovery is actually resetting the chip and restarting the driver on all active functions
of the chip where the parity error has been reported.
Signed-off-by: Vladislav Zolotarov <vladz@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

72fd0718

netfilter: xtables: remove old comments about reentrancy · 5b775eb1
Jan Engelhardt authored Apr 19, 2010
```
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
```
5b775eb1

netfilter: xt_TEE: have cloned packet travel through Xtables too · cd58bcd9

Jan Engelhardt authored Apr 19, 2010

Since Xtables is now reentrant/nestable, the cloned packet can also go
through Xtables and be subject to rules itself.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>

cd58bcd9

netfilter: xtables: make ip_tables reentrant · f3c5c1bf

Jan Engelhardt authored Apr 19, 2010

Currently, the table traverser stores return addresses in the ruleset
itself (struct ip6t_entry->comefrom). This has a well-known drawback:
the jumpstack is overwritten on reentry, making it necessary for
targets to return absolute verdicts. Also, the ruleset (which might
be heavy memory-wise) needs to be replicated for each CPU that can
possibly invoke ip6t_do_table.

This patch decouples the jumpstack from struct ip6t_entry and instead
puts it into xt_table_info. Not being restricted by 'comefrom'
anymore, we can set up a stack as needed. By default, there is room
allocated for two entries into the traverser.

arp_tables is not touched though, because there is just one/two
modules and further patches seek to collapse the table traverser
anyhow.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>

f3c5c1bf

netfilter: xtables: inclusion of xt_TEE · e281b198

Jan Engelhardt authored Apr 19, 2010

xt_TEE can be used to clone and reroute a packet. This can for
example be used to copy traffic at a router for logging purposes
to another dedicated machine.

References: http://www.gossamer-threads.com/lists/iptables/devel/68781Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>

e281b198

18 Apr, 2010 2 commits

net: Introduce skb_orphan_try() · fc6055a5

Eric Dumazet authored Apr 16, 2010

Transmitted skb might be attached to a socket and a destructor, for
memory accounting purposes.

Traditionally, this destructor is called at tx completion time, when skb
is freed.

When tx completion is performed by another cpu than the sender, this
forces some cache lines to change ownership. XPS was an attempt to give
tx completion to initial cpu.

David idea is to call destructor right before giving skb to device (call
to ndo_start_xmit()). Because device queues are usually small, orphaning
skb before tx completion is not a big deal. Some drivers already do
this, we could do it in upper level.

There is one known exception to this early orphaning, called tx
timestamping. It needs to keep a reference to socket until device can
give a hardware or software timestamp.

This patch adds a skb_orphan_try() helper, to centralize all exceptions
to early orphaning in one spot, and use it in dev_hard_start_xmit().

"tbench 16" results on a Nehalem machine (2 X5570  @ 2.93GHz)
before: Throughput 4428.9 MB/sec 16 procs
after: Throughput 4448.14 MB/sec 16 procs

UDP should get even better results, its destructor being more complex,
since SOCK_USE_WRITE_QUEUE is not set (four atomic ops instead of one)
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fc6055a5

net: remove time limit in process_backlog() · 9958da05

Eric Dumazet authored Apr 17, 2010

- There is no point to enforce a time limit in process_backlog(), since
other napi instances dont follow same rule. We can exit after only one
packet processed...
The normal quota of 64 packets per napi instance should be the norm, and
net_rx_action() already has its own time limit.
Note : /proc/net/core/dev_weight can be used to tune this 64 default
value.

- Use DEFINE_PER_CPU_ALIGNED for softnet_data definition.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9958da05

17 Apr, 2010 1 commit

rps: rps_sock_flow_table is mostly read · 8770acf0

Eric Dumazet authored Apr 17, 2010

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8770acf0

16 Apr, 2010 4 commits

rfs: Receive Flow Steering · fec5e652

Tom Herbert authored Apr 16, 2010

This patch implements receive flow steering (RFS).  RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running.  RFS is an
extension of Receive Packet Steering (RPS).

The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure.  The rxhash is passed in skb's received on
the connection from netif_receive_skb.  For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.

The convolution of the simple approach is that it would potentially
allow OOO packets.  If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.

To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.

rps_sock_table is a global hash table.  Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.

rps_dev_flow_table is specific to each device queue.  Each entry
contains a CPU and a tail queue counter.  The CPU is the "current"
CPU for a matching flow.  The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.

Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length.  When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.

And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted.  When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:

- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table.  This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.

Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality.  2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.

This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.

There are two configuration parameters for RFS.  The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue.  Both are rounded to power of two.

The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).

The benefits of RFS are dependent on cache hierarchy, application
load, and other factors.  On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation.  However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.

Below are some benchmark results which show the potential benfit of
this patch.  The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp.  The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.

e1000e on 8 core Intel
   No RFS or RPS		104K tps at 30% CPU
   No RFS (best RPS config):    290K tps at 63% CPU
   RFS				303K tps at 61% CPU

RPC test	tps	CPU%	50/90/99% usec latency	Latency StdDev
  No RFS/RPS	103K	48%	757/900/3185		4472.35
  RPS only:	174K	73%	415/993/2468		491.66
  RFS		223K	73%	379/651/1382		315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fec5e652

ipv6: fix the comment of ip6_xmit() · b5d43998

Shan Wei authored Apr 15, 2010

ip6_xmit() is used by upper transport protocol.
Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b5d43998

net: replace ipfragok with skb->local_df · 4e15ed4d

Shan Wei authored Apr 15, 2010

As Herbert Xu said: we should be able to simply replace ipfragok
with skb->local_df. commit f88037(sctp: Drop ipfargok in sctp_xmit function)
has droped ipfragok and set local_df value properly.

The patch kills the ipfragok parameter of .queue_xmit().
Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4e15ed4d

ipv6: cancel to setting local_df in ip6_xmit() · 0eecb784

Shan Wei authored Apr 15, 2010

commit f88037(sctp: Drop ipfargok in sctp_xmit function)
has droped ipfragok and set local_df value properly.

So the change of commit 77e2f1(ipv6: Fix ip6_xmit to
send fragments if ipfragok is true) is not needed.
So the patch remove them.
Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0eecb784