Commits · 3fb62c5d3fc1821f50c6003e582713857a520f6b · nexedi / linux

22 Apr, 2013 19 commits

net: remove a stale comment for dl_next · 3fb62c5d

Eric Dumazet authored Apr 19, 2013

dl_next member in struct request_sock doesn't need to be first.

We expect to insert a "struct common_sock" or a subset of it,
so this claim had to be verified.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3fb62c5d

qeth: Fix missing pointer update · d4ae1f5e

Stefan Raspl authored Apr 22, 2013

qeth_hdr_chk_and_bounce() can possibly shift the skb->data
pointer. However, the existing code didn't update the hdr pointer,
which should point to skb->data, accordingly.
Symptoms of this issue are sporadic recoveries.
Signed-off-by: Stefan Raspl <raspl@linux.vnet.ibm.com>
Signed-off-by: Frank Blaschka <blaschka@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d4ae1f5e

qeth: remove unused variable · 065cc782

Stefan Raspl authored Apr 22, 2013

remove unused variable
Signed-off-by: Stefan Raspl <raspl@linux.vnet.ibm.com>
Signed-off-by: Frank Blaschka <blaschka@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

065cc782

qeth: remove cast for kzalloc return value · 4a912f98

Zhang Yanfei authored Apr 22, 2013

remove cast for kzalloc return value.
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Frank Blaschka <blaschka@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4a912f98

xen-netback: don't disconnect frontend when seeing oversize packet · 03393fd5

Wei Liu authored Apr 22, 2013

Some frontend drivers are sending packets > 64 KiB in length. This length
overflows the length field in the first slot making the following slots have
an invalid length.

Turn this error back into a non-fatal error by dropping the packet. To avoid
having the following slots having fatal errors, consume all slots in the
packet.

This does not reopen the security hole in XSA-39 as if the packet as an
invalid number of slots it will still hit fatal error case.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

03393fd5

xen-netback: coalesce slots in TX path and fix regressions · 2810e5b9

Wei Liu authored Apr 22, 2013

This patch tries to coalesce tx requests when constructing grant copy
structures. It enables netback to deal with situation when frontend's
MAX_SKB_FRAGS is larger than backend's MAX_SKB_FRAGS.

With the help of coalescing, this patch tries to address two regressions
avoid reopening the security hole in XSA-39.

Regression 1. The reduction of the number of supported ring entries (slots)
per packet (from 18 to 17). This regression has been around for some time but
remains unnoticed until XSA-39 security fix. This is fixed by coalescing
slots.

Regression 2. The XSA-39 security fix turning "too many frags" errors from
just dropping the packet to a fatal error and disabling the VIF. This is fixed
by coalescing slots (handling 18 slots when backend's MAX_SKB_FRAGS is 17)
which rules out false positive (using 18 slots is legit) and dropping packets
using 19 to `max_skb_slots` slots.

To avoid reopening security hole in XSA-39, frontend sending packet using more
than max_skb_slots is considered malicious.

The behavior of netback for packet is thus:

    1-18            slots: valid
   19-max_skb_slots slots: drop and respond with an error
   max_skb_slots+   slots: fatal error

max_skb_slots is configurable by admin, default value is 20.

Also change variable name from "frags" to "slots" in netbk_count_requests.

Please note that RX path still has dependency on MAX_SKB_FRAGS. This will be
fixed with separate patch.
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2810e5b9

xen-netfront: reduce gso_max_size to account for max TCP header · 9ecd1a75

Wei Liu authored Apr 22, 2013

The maximum packet including header that can be handled by netfront / netback
wire format is 65535. Reduce gso_max_size accordingly.

Drop skb and print warning when skb->len > 65535. This can 1) save the effort
to send malformed packet to netback, 2) help spotting misconfiguration of
netfront in the future.
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9ecd1a75

xen-netfront: frags -> slots in log message · 697089dc

Wei Liu authored Apr 22, 2013

Also fix a typo in comment.
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

697089dc

be2net: enable IOMMU pass through for be2net · 2bd92cd2

Craig Hada authored Apr 21, 2013

This patch sets the coherent DMA mask to 64-bit after the be2net driver has
been acknowledged that the system is 64-bit DMA capable. The coherent DMA
mask is examined by the Intel IOMMU driver to determine whether to allow
pass through context mapping for all devices. With this patch, the be2net
driver combined with be2net compatible hardware provides comparable
performance to the case where vt-d is disabled. The main use-case for this
change is to decrease the time necessary to copy virtual machine memory
during KVM live migration instantiations.

This patch was tested on a system that enables the IOMMU in non-coherent
mode. Two DMA remapper issues were encountered in the previous version and
both patches have been committed.
    commit ea2447f7
    commit 2e12bc29Signed-off-by: Craig Hada <craig.hada@hp.com>
Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2bd92cd2

be2net: Use GET_PROFILE_CONFIG V1 cmd for BE3-R · a05f99db

Vasundhara Volam authored Apr 21, 2013

Use GET_PROFILE_CONFIG_V1 cmd for BE3-R, to query the maximum number of
TX rings available per function. On SH-R the same is queried via the
GET_FUNCTION_CONFIG cmd.
Signed-off-by: Vasundhara Volam <vasundhara.volam@emulex.com>
Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a05f99db

be2net: Avoid flashing BE3 UFI on BE3-R chip. · 0ad3157e

Vasundhara Volam authored Apr 21, 2013

Avoid flashing BE3 UFI on BE3-R chip by verifying asic_revision
number of the chip.
Signed-off-by: Vasundhara Volam <vasundhara.volam@emulex.com>
Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0ad3157e

be2net: Don't log "Out of MCCQ wrbs" error · 4d277125

Vasundhara Volam authored Apr 21, 2013

Don't log "Out of MCCQ wrbs" msg. When the driver doesn't receive any
response from the FW it already logs a "FW not responding" message.
The driver runs out of MCCQ wrbs much later. Also, this message can
swamp the kernel log in HW/FW error scenarios.
Signed-off-by: Vasundhara Volam <vasundhara.volam@emulex.com>
Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4d277125

be2net: Use TXQ_CREATE_V2 cmd · 94d73aaa

Vasundhara Volam authored Apr 21, 2013

Skyhawk-R and BE3-R (SuperNIC profile) require V2 version
of TXQ_CREATE cmd to be used.
Signed-off-by: Vasundhara Volam <vasundhara.volam@emulex.com>
Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

94d73aaa

bnx2x: update version to 1.78.17-0 · 26f26b3a

Dmitry Kravkov authored Apr 22, 2013

Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

26f26b3a

bnx2x: allow nvram test to run when device is down · d2d2d87d

Dmitry Kravkov authored Apr 22, 2013

Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d2d2d87d

bnx2x: add additional regions for CRC memory test · edb944d2

Dmitry Kravkov authored Apr 22, 2013

a. Common tree of `dir` structures.
b. Multi-port devices structures.

CC: Francious Romieu <romieu@fz.zoreil.com>
Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

edb944d2

bnx2x: remove non-necessary assignment · f1691dc6

Dmitry Kravkov authored Apr 22, 2013

CC: Francious Romieu <romieu@fz.zoreil.com>
Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f1691dc6

bnx2x: fix byte-by-byte nvram write for BE machines · 30c20b67

Dmitry Kravkov authored Apr 22, 2013

CC: Francious Romieu <romieu@fz.zoreil.com>
Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

30c20b67

bnx2x: refactor nvram read procedure · 85640952

Dmitry Kravkov authored Apr 22, 2013

introduce a procedure to read in u32 granularity.

CC: Francious Romieu <romieu@fz.zoreil.com>
Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

85640952

21 Apr, 2013 4 commits

qeth: fix VLAN related compilation errors · 91b1c1aa

Patrick McHardy authored Apr 21, 2013

drivers/s390/net/qeth_l3_main.c: In function 'qeth_l3_add_vlan_mc':
>> drivers/s390/net/qeth_l3_main.c:1662:3: error: too few arguments to function '__vlan_find_dev_deep'
   include/linux/if_vlan.h:88:27: note: declared here
   drivers/s390/net/qeth_l3_main.c: In function 'qeth_l3_add_vlan_mc6':
>> drivers/s390/net/qeth_l3_main.c:1723:3: error: too few arguments to function '__vlan_find_dev_deep'
   include/linux/if_vlan.h:88:27: note: declared here
   drivers/s390/net/qeth_l3_main.c: In function 'qeth_l3_free_vlan_addresses4':
>> drivers/s390/net/qeth_l3_main.c:1767:2: error: too few arguments to function '__vlan_find_dev_deep'
   include/linux/if_vlan.h:88:27: note: declared here
   drivers/s390/net/qeth_l3_main.c: In function 'qeth_l3_free_vlan_addresses6':
>> drivers/s390/net/qeth_l3_main.c:1797:2: error: too few arguments to function '__vlan_find_dev_deep'
   include/linux/if_vlan.h:88:27: note: declared here
   drivers/s390/net/qeth_l3_main.c: In function 'qeth_l3_process_inbound_buffer':
>> drivers/s390/net/qeth_l3_main.c:1980:6: error: too few arguments to function '__vlan_hwaccel_put_tag'
   include/linux/if_vlan.h:234:31: note: declared here
   drivers/s390/net/qeth_l3_main.c: In function 'qeth_l3_verify_vlan_dev':
>> drivers/s390/net/qeth_l3_main.c:2089:3: error: too few arguments to function '__vlan_find_dev_deep'
   include/linux/if_vlan.h:88:27: note: declared here
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

91b1c1aa

net: vlan: fix up vlan_proto_idx() for CONFIG_BUG=n · 8da63a65

Patrick McHardy authored Apr 21, 2013

Add missing return statement for CONFIG_BUG=n.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

8da63a65

net: vlan: fix dummy function signatures for CONFIG_VLAN=n · 9fae27b3

Patrick McHardy authored Apr 20, 2013

Fix up some function signatures for CONFIG_VLAN=n that were missed during
the 802.1ad support patches.

Found by the kbuild robot.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

9fae27b3

net: vlan: fix memory leak in vlan_info_rcu_free() · cf2c014a

Patrick McHardy authored Apr 20, 2013

The following leak is reported by kmemleak:

[   86.812073] kmemleak: Found object by alias at 0xffff88006ecc76f0
[   86.816019] Pid: 739, comm: kworker/u:1 Not tainted 3.9.0-rc5+ #842
[   86.816019] Call Trace:
[   86.816019]  <IRQ>  [<ffffffff81151c58>] find_and_get_object+0x8c/0xdf
[   86.816019]  [<ffffffff8190e90d>] ? vlan_info_rcu_free+0x33/0x49
[   86.816019]  [<ffffffff81151cbe>] delete_object_full+0x13/0x2f
[   86.816019]  [<ffffffff8194bbb6>] kmemleak_free+0x26/0x45
[   86.816019]  [<ffffffff8113e8c7>] slab_free_hook+0x1e/0x7b
[   86.816019]  [<ffffffff81141c05>] kfree+0xce/0x14b
[   86.816019]  [<ffffffff8190e90d>] vlan_info_rcu_free+0x33/0x49
[   86.816019]  [<ffffffff810d0b0b>] rcu_do_batch+0x261/0x4e7

The reason is that in vlan_info_rcu_free() we don't take the VLAN protocol
into account when iterating over the vlan_devices_array.
Reported-by: Cong Wang <amwang@redhat.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Tested-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cf2c014a

19 Apr, 2013 17 commits

Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next · 95a06161

David S. Miller authored Apr 19, 2013

Pablo Neira Ayuso says:

====================
The following patchset contains a small batch of Netfilter
updates for your net-next tree, they are:

* Three patches that provide more accurate error reporting to
  user-space, instead of -EPERM, in IPv4/IPv6 netfilter re-routing
  code and NAT, from Patrick McHardy.

* Update copyright statements in Netfilter filters of
  Patrick McHardy, from himself.

* Add Kconfig dependency on the raw/mangle tables to the
  rpfilter, from Florian Westphal.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

95a06161

bond: add support to read speed and duplex via ethtool · bb5b052f

Andy Gospodarek authored Apr 16, 2013

This patch adds support for the get_settings ethtool op to the bonding
driver.  This was motivated by users who wanted to get the speed of the
bond and compare that against throughput to understand utilization.
The behavior before this patch was added was problematic when computing
line utilization after trying to get link-speed and throughput via SNMP.

Output from ethtool looks like this for a round-robin bond:

Settings for bond0:
	Supported ports: [ ]
	Supported link modes:   Not reported
	Supported pause frame use: No
	Supports auto-negotiation: No
	Advertised link modes:  Not reported
	Advertised pause frame use: No
	Advertised auto-negotiation: No
	Speed: 11000Mb/s
	Duplex: Full
	Port: Other
	PHYAD: 0
	Transceiver: internal
	Auto-negotiation: off
	MDI-X: Unknown
	Link detected: yes

I tested this and verified it works as expected.  A test was also done
on a version backported to an older kernel and it worked well there.

v2: Switch to using ethtool_cmd_speed_set to set speed, added check to
SLAVE_IS_OK for each slave in bond, dropped mode-specific calculations
as they were not needed, and set port type to 'Other.'

v3: Fix useless assignment and checkpatch warning.
Signed-off-by: Andy Gospodarek <andy@greyhouse.net>
Reviewed-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bb5b052f

packet: move hw/sw timestamp extraction into a small helper · 4b457bdf

Daniel Borkmann authored Apr 16, 2013

This patch introduces a small, internal helper function, that is used by
PF_PACKET. Based on the flags that are passed, it extracts the packet
timestamp in the receive path. This is merely a refactoring to remove
some duplicate code in tpacket_rcv(), to make it more readable, and to
enable others to use this function in PF_PACKET as well, e.g. for TX.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4b457bdf

net: socket: move ktime2ts to ktime header api · 6e94d1ef

Daniel Borkmann authored Apr 16, 2013

Currently, ktime2ts is a small helper function that is only used in
net/socket.c. Move this helper into the ktime API as a small inline
function, so that i) it's maintained together with ktime routines,
and ii) also other files can make use of it. The function is named
ktime_to_timespec_cond() and placed into the generic part of ktime,
since we internally make use of ktime_to_timespec(). ktime_to_timespec()
itself does not check the ktime variable for zero, hence, we name
this function ktime_to_timespec_cond() for only a conditional
conversion, and adapt its users to it.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6e94d1ef

net: Add .gitignore to networking selftests directory. · cf270148
David S. Miller authored Apr 19, 2013
```
Signed-off-by: David S. Miller <davem@davemloft.net>
```
cf270148
net: Add missing netdev feature strings for NETIF_F_HW_VLAN_STAG_* · 2d6577f1
David S. Miller authored Apr 19, 2013
```
Noticed by Ben Hutchings.
Signed-off-by: David S. Miller <davem@davemloft.net>
```
2d6577f1

Merge branch 'qlcnic' · 92352df1

David S. Miller authored Apr 19, 2013

Rajesh Borundia says:

====================
* "qlcnic: Change 82xx adapter VLAN id endian type".
  - Adapter requires VLAN id in little endian. VLAN id was being
    converted to __le16 and then passed as a parameter. Pass VLAN id
    as u16 and then use cpu_to_le16 at appropriate places. It is
    appropriate for net-next as SR-IOV patches have a dependency on it.
* "qlcnic: Fix loopback test for SR-IOV PF".
  - It is appropriate for net-next as change is needed for SRIOV PF
    only.
* Remaining patches add enhancements to SR-IOV functionality like
  - FLR handling
  - Adapter reset recovery handling
  - iproute2 tool support for configuring MAC address, Tx rate and
    VLAN id.
  - Mailbox polling support for SR-IOV PF in case mailbox interrupts
    are disabled.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

92352df1

qlcnic: Update version to 5.2.41 · c6376278

Rajesh Borundia authored Apr 19, 2013

Signed-off-by: Rajesh Borundia <rajesh.borundia@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c6376278

qlcnic: Support polling for mailbox events. · 7ed3ce48

Rajesh Borundia authored Apr 19, 2013

o When mailbox interrupt is disabled PF should be
  able to process request from VF. Enable polling
  for such cases.
Signed-off-by: Rajesh Borundia <rajesh.borundia@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7ed3ce48

qlcnic: Fix loopback test for SR-IOV PF. · d1a1105e

Rajesh Borundia authored Apr 19, 2013

o Do not disable mailbox interrupts while running
  loopback test through SR-IOV PF.
Signed-off-by: Manish Chopra <manish.chopra@qlogic.com>
Signed-off-by: Rajesh Borundia <rajesh.borundia@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d1a1105e

qlcnic: Support VLAN id config. · 91b7282b

Rajesh Borundia authored Apr 19, 2013

o Add support for VLAN id configuration per VF using
  iproute2 tool.
o VLAN id's 1-4094 are treated as PVID by the PF and
  Guest VLAN tagging is not allowed by default.
o PVID is disabled when the VLAN id is set to 0
o Guest VLAN tagging is allowed when the VLAN id is set to 4095.
o Only one Guest VLAN id  is supported.
o VLAN id can be changed only when the VF driver is not loaded.
Signed-off-by: Manish Chopra <manish.chopra@qlogic.com>
Signed-off-by: Sucheta Chakraborty <sucheta.chakraborty@qlogic.com>
Signed-off-by: Rajesh Borundia <rajesh.borundia@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

91b7282b

qlcnic: Support MAC address, Tx rate config. · 4000e7a7

Rajesh Borundia authored Apr 19, 2013

o Add support for MAC address and Tx rate configuration
  per VF via iproute2 tool.
o Tx rate change is allowed while the guest is running
  and the VF driver is loaded.
o MAC address change is allowed only when VF driver
  is not loaded.
Signed-off-by: Manish Chopra <manish.chopra@qlogic.com>
Signed-off-by: Sucheta Chakraborty <sucheta.chakraborty@qlogic.com>
Signed-off-by: Rajesh Borundia <rajesh.borundia@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4000e7a7

qlcnic: VF reset recovery implementation. · f036e4f4

Rajesh Borundia authored Apr 19, 2013

o Implement recovery mechanism for VF to recover from
  adapter resets.
Signed-off-by: Manish Chopra <manish.chopra@qlogic.com>
Signed-off-by: Sucheta Chakraborty <sucheta.chakraborty@qlogic.com>
Signed-off-by: Rajesh Borundia <rajesh.borundia@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f036e4f4

qlcnic: VF FLR implementation. · 97d8105c

Rajesh Borundia authored Apr 19, 2013

o FLR from Hypervisor - When hypervisor issues a VF FLR request,
  adapter notifies the parent PF driver of the FLR request for PF
  driver to perform any cleanup on behalf of that VF.
o FLR from VF Driver - VF driver may initiate a VF FLR request,
  if VF state needs to be cleaned up before a re-initialization.
  VF re-initialization during kdump is an example.
o PF driver cleans up all resources allocated on behalf of a  VF,
  on VF FLR notifications from the adapter or from the VF driver.
Signed-off-by: Manish Chopra <manish.chopra@qlogic.com>
Signed-off-by: Sucheta Chakraborty <sucheta.chakraborty@qlogic.com>
Signed-off-by: Rajesh Borundia <rajesh.borundia@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

97d8105c

qlcnic: Change 82xx adapter VLAN id endian type. · f80bc8fe

Rajesh Borundia authored Apr 19, 2013

o 82xx adapter requires VLAN id in little endian format.
  Instead of passing vlan id parameter as __le16, pass the
  parameter as u16 and  use cpu_to_le16 at appropriate places.
Signed-off-by: Rajesh Borundia <rajesh.borundia@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f80bc8fe

Merge branch 'netlink-mmap' · 42bbcb78

David S. Miller authored Apr 19, 2013

Patrick McHardy says:

====================
The following patches contain an implementation of memory mapped I/O for
netlink. The implementation is modelled after AF_PACKET memory mapped I/O
with a few differences:

- In order to perform memory mapped I/O to userspace, the kernel allocates
skbs with the data area pointing to the data area of the mapped frames.
All netlink subsystems assume a linear data area, so for the sake of
simplicity, the mapped data area is not attached to the paged area but
to skb->data. This requires introduction of a special skb alloction
function that just allocates an skb head without the data area. Since this
is a quite rare use case, I introduced a new function based on __alloc_skb
instead of splitting it up into head and data alloction. The alternative
would be to introduce an __alloc_skb_head and __alloc_skb_data function,
which would actually be useful for a specific error case in memory mapped
netlink, but would require a couple of extra instructions for the common
skb allocation case, so it doesn't really seem worth it.

In order to get the destination memory area for skb->data before message
construction, memory mapped netlink I/O needs to look up the destination
socket during allocation instead of during transmission because the
ring is owned by the receiveing socket/process. A special skb allocation
function (netlink_alloc_skb) taking the destination pid as an argument is
used for this, all subsystems that want to support memory mapped I/O need
to use this function, automatic fallback to the receive queue happens
for unconverted subsystems. Dumps automatically use memory mapped I/O if
the receiving socket has enabled it.

The visible effect of looking up the destination socket during allocation
instead of transmission is that message ordering in userspace might
change in case allocation and transmission aren't performed atomically.
This usually doesn't matter since most subsystems have a BKL-like lock
like the rtnl mutex, to my knowledge the currently only existing case
where it might matter is nfnetlink_queue combined with the recently
introduced batched verdicts, but a) that subsystem already includes
sequence numbers which allow userspace to reorder messages in case it
cares to, also the reodering window is quite small and b) with memory
mapped transmission batching can be performed in a subsystem indepandant
manner.

- AF_NETLINK contains flow control for database dumps, with regular I/O
dump continuation are triggered based on the sockets receive queue space
and by recvmsg() calls. Since with memory mapped I/O there are no
recvmsg() calls under normal operation, this is done in netlink_poll(),
under the assumption that userspace has processed all pending frames
before invoking poll(), thus the ring is expected to have room for new
messages. Dumps currently don't benefit as much as they could from
memory mapped I/O because each single continuation requires a poll()
call. A more agressive approach seems like a good idea to me, especially
in case the socket is not subscribed to any multicast groups (IOW only
receiving explicitly requested data).

Besides that, the memory mapped netlink implementation extends the states
defined by AF_PACKET between userspace and the kernel by a SKIP status, this
is intended for the case that userspace wants to queue frames (specifically
when using nfnetlink_queue, an IDS and stream reassembly, requested by
Eric Leblond) for a longer period of time. The kernel skips over all frames
marked with SKIP when looking or unused frames and only fails when not finding
a free frame or when having skipped the entire ring.

Also noteworthy is memory mapped sendmsg: the kernel performs validation
of messages before accepting and processing them, in order to prevent
userspace from changing the messages contents after validation, the
kernel checks that the ring is only mapped once and the file descriptor
is not shared (in order to avoid having userspace set up another mapping
after the first mentioned check). If either of both is not true, the
message copied to an allocated skb and processed as with regular I/O.
I'd especially appreciate review of this part since I'm not really versed
in memory, file and process management,

The remaining interesting details are included in the changelogs of the
individual patches and the documentation, so I won't repeat them here.

As an example, nfnetlink_queue is convererted to support memory mapped
I/O. Other subsystems that would probably benefit are nfnetlink_log,
audit and maybe ISCSI, not sure.

Following are some numbers collected by Florian Westphal based on a
slightly older version, which included an experimental patch for the
nfnetlink_queue ordering issue.

===

Test hardware is a 12-core machine
Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
ixgbe interfaces are used (i.e., multiqueue nics).
irqs are distributed across the cpus.

I've made several tests.

The simple one consists of 3GBit UDP traffic, packets are 1500 bytes
in size (i.e., no fragmentation), with a single nfqueue
and the test client programs in libmnl examples directory.
Packets are sent from one /24 net to another /24 net, i.e.
there are a few hundred flows active at any given time.

I've also tested with snort, but I disabled all rules.
6Gbit UDP traffic is generated in the snort case, and
6 nfqueues are used (i.e., 6 snorts run in parallel).

I've tested with 3 different kernels, all based on 3.7.1.
- 3.7.1, without the mmap patches
- 3.7.1, with Patricks mmap patches
- 3.7.1, with mmap patches and extended spinlock to ensure packet ids are
monotonically increasing and cannot be re-ordered. This is what we
currently ship in our product.

[ the spinlock that is extended is the per nfqueue spinlock, it will
be held from the time the netlink skb is allocated until the netlink
skb is sent to userspace:

http://1984.lsi.us.es/git/nf-next/commit/?h=mmap-netlink3&id=b8eb19c46650fef4e9e4fe53f367f99bbf72afc9
]

snort is normally used in "batch mode", i.e., after processing 25 packets
a single "batch verdict" is sent to accept the packets seen so far.
"mmap snort" means RX_RING + sendmsg(), i.e. TX_RING is not used at this
time (except where noted below).

One reason is that snort has a reload thread, so kernel needs to copy;
also in the snort case no payload rewrite takes place, so compared
to the rx path the tx path is cheap.

Results:

3.7.1, without mmap patches, i.e. recv()+sendmsg() for everyone
nfq-queue: 1.7 gbit out
snort-recv-batch-25 5.1 gbit out
snort-recv-no-batch 3.1 gbit out

3.7.1 + mmap + without extended spinlocked section
nfq-queue: 1.7 gbit out (recv/sendmsg)
nfq-queue-mmap: 2.4 gbit out
snort-mmap-batch-25 5.6 gbit out (warning: since ids can be
re-ordered, this version is "broken").
snort-recv-batch-25 5.1 gbit out
snort-mmap-no-batch 4.6 gbit out (i.e., one verdict per packet)

Kernel 3.7.1 + mmap + extended spinlock section:
nfq-queue: 1.4 gbit out
nfq-queue-mmap: 2.3 gbit out
snort: 5.6 gbit out

Conclusions:
- The "extended spinlocked section" hurts performance in the
single queue case; with 6 snorts there is no measureable slowdown.
- I tried to re-write the mmap-snort to work without batch verdicts, but
results were not very encouraging:

kernel 3.7.1 + mmap (without extended spinlocked section):

snort-mmap-batch-25 5.6 gbit out (what we currenlty ship)
snort-recv-batch-25 5.1 gbit out (without using mmap)
snort-mmap-batch-1 4.6 gbit out (with mmap but without batch verdicts)
snort-mmap-txring-25 5.2 gbit out (with mmap but without batch verdicts)
snort-mmap-txring-1 4.6 gbit out (with mmap but without batch verdicts)

The difference between the last two is that in the txring-25 case, we
put a verdict into the tx ring after every packet, but will only
invoke sendmsg(, NULL, 0) after processing 25 packets. So the only
difference is the number of sendmsg calls/context switches.

So, i.o.w, kernel 3.7.1 + mmap + the extra locking crap is faster
than 3.7.1 + mmap-without-extra-locking and single-verdict-per packet.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

42bbcb78

nfnetlink: add support for memory mapped netlink · 3ab1f683

Patrick McHardy authored Apr 17, 2013

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

3ab1f683