Commits · 2f7878c06e2d227aa5c405ddde356403b83e3509 · Kirill Smelkov / linux

25 Apr, 2017 40 commits

qed: fix invalid use of sizeof in qed_alloc_qm_data() · 2f7878c0

Wei Yongjun authored Apr 25, 2017

sizeof() when applied to a pointer typed expression gives the
size of the pointer, not that of the pointed data.

Fixes: b5a9ee7c ("qed: Revise QM configuration")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Acked-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2f7878c0

bpf: map_get_next_key to return first key on NULL · 8fe45924

Teng Qin authored Apr 24, 2017

When iterating through a map, we need to find a key that does not exist
in the map so map_get_next_key will give us the first key of the map.
This often requires a lot of guessing in production systems.

This patch makes map_get_next_key return the first key when the key
pointer in the parameter is NULL.
Signed-off-by: Teng Qin <qinteng@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

8fe45924

selftests/net: Fix broken test case in psock_fanout · 472ecf08

Mike Maloney authored Apr 24, 2017

The error return falue form sock_fanout_open is -1, not zero.  One test
case was checking for 0 instead of -1.

Tested: Built and tested in clean client.
Signed-off-by: Mike Maloney <maloney@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

472ecf08

net: ethernet: ti: netcp_core: remove unused compl queue mapping · 799dbe3e

Ivan Khoronzhuk authored Apr 24, 2017

This code is unused and probably was unintentionally left while
moving completion queue mapping in submit function.
Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

799dbe3e

Merge branch 'qed-vf-tunnel' · c1a9f80e

David S. Miller authored Apr 25, 2017

Manish Chopra says:

====================
qed/qede: VF tunnelling support

With this series VFs can run vxlan/geneve/gre tunnels over it.
Please consider applying this series to "net-next"
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

c1a9f80e

qed - VF tunnelling support [VXLAN/GENEVE/GRE] · eaf3c0c6

Chopra, Manish authored Apr 24, 2017

This patch adds hardware channel APIs support between
VF and PF for tunnelling configuration for the VFs.
According to that configuration VFs can run VXLAN/GENEVE/GRE
tunnels over it with tunnel features offloaded.

Using these APIs VF can also request for UDP ports configuration
to the PF, although PF and it's child VFs share the same port.
Signed-off-by: Manish Chopra <manish.chopra@cavium.com>
Signed-off-by: Yuval Mintz <yuval.mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

eaf3c0c6

qed/qede: Add UDP ports in bulletin board · 97379f15

Chopra, Manish authored Apr 24, 2017

This patch adds support for UDP ports in bulletin board
to notify UDP ports change to the VFs
Signed-off-by: Manish Chopra <manish.chopra@cavium.com>
Signed-off-by: Yuval Mintz <yuval.mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

97379f15

qede: Configure UDP ports in local context. · 327a2b75

Chopra, Manish authored Apr 24, 2017

This patch configures UDP ports locally instead of
configuring them in deferred context which would be
helpful in synchronizing UDP ports configuration for VFs
which will be enabled in further patches.
Signed-off-by: Manish Chopra <manish.chopra@cavium.com>
Signed-off-by: Yuval Mintz <yuval.mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

327a2b75

qede: Disable tunnel offloads for non offloaded UDP ports · 369bfd4e

Chopra, Manish authored Apr 24, 2017

This patch disables tunnel offloads via ndo_features_check()
if given UDP port is not offloaded to hardware. This in turn
allows to run multiple tunnel interfaces using different UDP ports.
Signed-off-by: Manish Chopra <manish.chopra@cavium.com>
Signed-off-by: Yuval Mintz <yuval.mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

369bfd4e

qed/qede: Enable tunnel offloads based on hw configuration · 19489c7f

Chopra, Manish authored Apr 24, 2017

This patch enables tunnel feature offloads based on hw configuration
at initialization time instead of enabling them always.
Signed-off-by: Manish Chopra <manish.chopra@cavium.com>
Signed-off-by: Yuval Mintz <yuval.mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

19489c7f

qed: refactor tunnelling - API/Structs · 19968430

Chopra, Manish authored Apr 24, 2017

This patch changes the tunnel APIs to use per tunnel
info instead of using bitmasks for all tunnels and also
uses single struct to hold the data to prepare multiple
variant of tunnel configuration ramrods to be sent to the hardware.
Signed-off-by: Manish Chopra <manish.chopra@cavium.com>
Signed-off-by: Yuval Mintz <yuval.mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

19968430

Merge branch 'l2tpeth-info' · 36784277

David S. Miller authored Apr 25, 2017

Guillaume Nault says:

====================
l2tp: add informations about l2tpeth interfaces in /sys

Patch #1 lets userspace retrieve the naming scheme of an l2tpeth
interface, using /sys/class/net/<iface>/name_assign_type.

Patch #2 adds the DEVTYPE field in /sys/class/net/<iface>/uevent so
that userspace can reliably know if a device is an l2tpeth interface.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

36784277

l2tp: define "l2tpeth" device type · a485c2b8

Guillaume Nault authored Apr 24, 2017

Export type of l2tpeth interfaces to userspace
(/sys/class/net/<iface>/uevent).
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Acked-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a485c2b8

l2tp: set name_assign_type for devices created by l2tp_eth.c · c39855fe

Guillaume Nault authored Apr 24, 2017

Export naming scheme used when creating l2tpeth interfaces
(/sys/class/net/<iface>/name_assign_type). This let userspace know if
the device's name has been generated automatically or defined manually.
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Acked-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c39855fe

net sched actions: Complete the JUMPX opcode · e0ee84de

Jamal Hadi Salim authored Apr 23, 2017

per discussion at netconf/netdev:
When we have an action that is capable of branching (example a policer),
we can achieve a continuation of the action graph by programming a
"continue" where we find an exact replica of the same filter rule with a lower
priority and the remainder of the action graph. When you have 100s of thousands
of filters which require such a feature it gets very inefficient to do two
lookups.

This patch completes a leftover feature of action codes. Its time has come.

Example below where a user labels packets with a different skbmark on ingress
of a port depending on whether they have/not exceeded the configured rate.
This mark is then used to make further decisions on some egress port.

 #rate control, very low so we can easily see the effect
sudo $TC actions add action police rate 1kbit burst 90k \
conform-exceed pipe/jump 2 index 10
 # skbedit index 11 will be used if the user conforms
sudo $TC actions add action skbedit mark 11 ok index 11
 # skbedit index 12 will be used if the user does not conform
sudo $TC actions add action skbedit mark 12 ok index 12

 #lets bind the user ..
sudo $TC filter add dev $ETH parent ffff: protocol ip prio 8 u32 \
match ip dst 127.0.0.8/32 flowid 1:10 \
action police index 10 \
action skbedit index 11 \
action skbedit index 12

 #run a ping -f and see what happens..
 #
jhs@foobar:~$ sudo $TC -s filter ls dev $ETH parent ffff: protocol ip
filter pref 8 u32
filter pref 8 u32 fh 800: ht divisor 1
filter pref 8 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:10  (rule hit 2800 success 1005)
  match 7f000008/ffffffff at 16 (success 1005 )
	action order 1:  police 0xa rate 1Kbit burst 23440b mtu 2Kb action pipe/jump 2 overhead 0b
	ref 2 bind 1 installed 207 sec used 122 sec
	Action statistics:
	Sent 84420 bytes 1005 pkt (dropped 0, overlimits 721 requeues 0)
	backlog 0b 0p requeues 0

	action order 2:  skbedit mark 11 pass
	 index 11 ref 2 bind 1 installed 204 sec used 122 sec
 	Action statistics:
	Sent 60564 bytes 721 pkt (dropped 0, overlimits 0 requeues 0)
	backlog 0b 0p requeues 0

	action order 3:  skbedit mark 12 pass
	 index 12 ref 2 bind 1 installed 201 sec used 122 sec
 	Action statistics:
	Sent 23856 bytes 284 pkt (dropped 0, overlimits 0 requeues 0)
	backlog 0b 0p requeues 0

Not bad, about 28% non-conforming packets..
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e0ee84de

Merge tag 'linux-can-next-for-4.12-20170425' of... · 45a6f3bc

David S. Miller authored Apr 25, 2017

Merge tag 'linux-can-next-for-4.12-20170425' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next

Marc Kleine-Budde says:

====================
pull-request: can-next 2017-04-25

this is a pull request of 21 patches for net-next/master.

There are 4 patches by Stephane Grosjean for the PEAK PCAN-PCIe FD
CAN-FD boards. The next 7 patches are by Mario Huettel, which add
support for M_CAN IP version >= v3.1.x to the m_can driver. A patch by
Remigiusz Kołłątaj adds support for the Microchip CAN BUS Analyzer. 8
patches by Oliver Hartkopp complete the initial CAN network namespace
support. Wei Yongjun's patch for the ti_hecc driver fixes the return
value check in the probe function.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

45a6f3bc

ipvlan: use pernet operations and restrict l3s hooks to master netns · 3133822f

Florian Westphal authored Apr 20, 2017

commit 4fbae7d8 ("ipvlan: Introduce l3s mode") added
registration of netfilter hooks via nf_register_hooks().

This API provides the illusion of 'global' netfilter hooks by placing the
hooks in all current and future network namespaces.

In case of ipvlan the hook appears to be only needed in the namespace
that contains the ipvlan master device (i.e., usually init_net), so
placing them in all namespaces is not needed.

This switches ipvlan driver to pernet operations, and then only registers
hooks in namespaces where a ipvlan master device is set to l3s mode.

Extra care has to be taken when the master device is moved to another
namespace, as we might have to 'move' the netfilter hooks too.

This is done by storing the namespace the ipvlan port was created in.
On REGISTER event, do (un)register operations in the old/new namespaces.

This will also allow removal of the nf_register_hooks() in a future patch.

Cc: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

3133822f

can: ti_hecc: fix return value check in ti_hecc_probe() · b655f0e9

Wei Yongjun authored Apr 25, 2017

In case of error, the function devm_ioremap_resource() returns ERR_PTR()
and never returns NULL. The NULL test in the return value check should
be replaced with IS_ERR().

Fixes: dabf54dd ("can: ti_hecc: Convert TI HECC driver to DT only driver")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

b655f0e9

can: enable module auto loading for virtual CAN interfaces · 5e64ebc1

Oliver Hartkopp authored Apr 25, 2017

Autoload the vcan module when a vcan instance is to be created by
'ip link add type vcan'
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

5e64ebc1

can: add Virtual CAN Tunnel driver (vxcan) · a8f820a3

Oliver Hartkopp authored Apr 25, 2017

Similar to the virtual ethernet driver veth, vxcan implements a
local CAN traffic tunnel between two virtual CAN network devices.
See Kconfig entry for details.
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

a8f820a3

can: network namespace support for CAN gateway · 1ef83310

Oliver Hartkopp authored Apr 25, 2017

The CAN gateway was not implemented as per-net in the initial network
namespace support by Mario Kicherer (8e8cda6d).
This patch enables the CAN gateway to be used in different namespaces.
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

1ef83310

can: network namespace support for CAN_BCM protocol · 384317ef

Oliver Hartkopp authored Apr 25, 2017

The CAN_BCM protocol and its procfs entries were not implemented as per-net
in the initial network namespace support by Mario Kicherer (8e8cda6d).
This patch adds the missing per-net functionality for the CAN BCM.
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

384317ef

can: complete initial namespace support · cb5635a3

Oliver Hartkopp authored Apr 25, 2017

The statistics and its proc output was not implemented as per-net in the
initial network namespace support by Mario Kicherer (8e8cda6d).
This patch adds the missing per-net statistics for the CAN subsystem.
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

cb5635a3

can: remove obsolete definitions · f2e72f43

Oliver Hartkopp authored Apr 25, 2017

can_rx_alldev_list is a per-net data structure now. Remove it's definition
here and can_rx_dev_list too.
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

f2e72f43

can: remove obsolete pernet_operations definitions · 48452c16

Oliver Hartkopp authored Apr 25, 2017

The namespace support for the CAN subsystem does not need any additional
memory. So when ".size = 0" there's no extra memory allocated by the system.
And therefore ".id" is obsolete too.
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

48452c16

can: fix memory leak in initial namespace support · a7bbd28f

Oliver Hartkopp authored Apr 25, 2017

The can_rx_alldev_list is a per-net data structure now and allocated in
can_pernet_init(). Make sure the memory is free'd in can_pernet_exit() too.
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

a7bbd28f

can: mcba_usb: Add support for Microchip CAN BUS Analyzer · 51f3baad

Remigiusz Kołłątaj authored Apr 14, 2017

SocketCAN driver for Microchip CAN BUS Analyzer
(http://www.microchip.com/development-tools/)

Changes in v4:
- possible memory leak fixed in mcba_usb_write_bulk_callback
- LED support added
- failure handling in mcba_usb_probe improved
- C99 initializers for structs on stack

Changes in v3:
- improved/simplified CAN ID conversion
- functions for transmission of skb and cmd separated
- fixed/improved netif_stop_queue handling
- style/cosmetic corrections

Changes in v2:
- Termination handling reimplemented to fit new netlink API
(IFLA_CAN_TERMINATION)
- Bitrate handling reimplemented to fit new netlink API
(IFLA_CAN_BITRATE)
- CAN ID conversion refactored (changed from macro to inline functions)
- CAN DLC handling using get_can_dlc()
- Endianness handling for can_speed introduced
- Debugging removed
- Redundant error prints removed
- Style/cosmetic corrections (i.e. macro names, redefs, inits etc.)
Signed-off-by: Remigiusz Kołłątaj <remigiusz.kollataj@mobica.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

51f3baad

can: m_can: Enable TX FIFO Handling for M_CAN IP version >= v3.1.x · 10c1c397

Mario Huettel authored Apr 08, 2017

* Added defines for TX Event FIFO Element
* Adapted ndo_start_xmit function.
  For versions >= v3.1.x it uses the TX FIFO to optimize the data
  throughput. It stores the echo skb at the same index as in the
  M_CAN's TX FIFO. The frame's message marker is set to this index.
  This message marker is received in the TX Event FIFO after
  the message was successfully transmitted. It is used to echo the
  correct echo skb back to the network stack.
* Added m_can_echo_tx_event function. It reads all received
  message markers in the TX Event FIFO and loops back the
  corresponding echo skbs.
* ISR checks for new TX Event Entry interrupt for version >= 3.1.x.
Signed-off-by: Mario Huettel <mario.huettel@gmx.net>
Reviewed-by: Oliver Hartkopp <socketcan@hartkopp.net>
Tested-by: Quentin Schulz <quentin.schulz@free-electrons.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

10c1c397

can: m_can: Configuration for TX and TX event FIFOs · 428479e4

Mario Huettel authored Apr 08, 2017

* TX/TX Event FIFO sizes are configured for version >= v3.1.x
Signed-off-by: Mario Huettel <mario.huettel@gmx.net>
Tested-by: Quentin Schulz <quentin.schulz@free-electrons.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

428479e4

can: m_can: Enable M_CAN version dependent initialization · b03cfc5b

Mario Huettel authored Apr 08, 2017

This patch adapts the initialization of the M_CAN. So it can be used
with all versions >= 3.0.x.

Changes:
* Added version element to m_can_priv structure to hold M_CAN version.
* Renamed bittiming structs for version 3.0.x
* Added new bittiming structs for version >= 3.1.x
* Function alloc_m_can_dev takes 2 new arguments. The TX FIFO size and the
  base address of the module.
* Chip configuration for CAN_CTRLMODE_LOOPBACK is changed: Enabled
  CCCR_MON bit. In combination with TEST_LBCK it activates the internal
  loopback mode. Leaving CCCR_MON '0' results in external loopback mode.
* Clocks are temporarily enabled by platform_propbe function in order to
  allow read access to the Core Release register and the Control Register.
  Registers are used to detect M_CAN version and optional Non-ISO Feature.

Initialization of M_CAN for version >= 3.1.x:
* TX FIFO of M_CAN is used to transmit frames. The driver does not need to
  stop the tx queue after each frame sent.
* Initialization of TX Event FIFO is added.
* NON-ISO is fixed for all M_CAN versions < 3.2.x. Version 3.2.x _can_ have
  the NISO (Non-ISO) bit which can switch the mode of the M_CAN to Non-ISO
  mode. This bit does not have to be writeable. Therefore it is checked.
  If it is writable Non-ISO support is added to the controllers supported
  CAN modes.

New Functions:
* Function to check the Core Release version. The read value determines the
  behaviour of the driver.
* Function to check if the NISO bit for version >= 3.2.x is implemented.
Signed-off-by: Mario Huettel <mario.huettel@gmx.net>
Reviewed-by: Oliver Hartkopp <socketcan@hartkopp.net>
Tested-by: Quentin Schulz <quentin.schulz@free-electrons.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

b03cfc5b

can: m_can: Updated register defines to newest version · 5e1bd15a

Mario Huettel authored Apr 08, 2017

* Updated register defines to newest M_CAN version (v3.2.1).
* Changed defines in the whole code.
Signed-off-by: Mario Huettel <mario.huettel@gmx.net>
Reviewed-by: Oliver Hartkopp <socketcan@hartkopp.net>
Tested-by: Quentin Schulz <quentin.schulz@free-electrons.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

5e1bd15a

can: m_can: Removed virtual address from print · ee8c3f6f

Mario Huettel authored Apr 08, 2017

The virtual address of the device was printed. I removed it because it
leaks internal information.
Signed-off-by: Mario Huettel <mario.huettel@gmx.net>
Tested-by: Quentin Schulz <quentin.schulz@free-electrons.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

ee8c3f6f

can: m_can: Removed initialization of FIFO water marks · 8f265895

Mario Huettel authored Apr 08, 2017

FIFO water marks disabled because the driver doesn't handle water mark
events.
Signed-off-by: Mario Huettel <mario.huettel@gmx.net>
Reviewed-by: Oliver Hartkopp <socketcan@hartkopp.net>
Tested-by: Quentin Schulz <quentin.schulz@free-electrons.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

8f265895

can: m_can: Disabled Interrupt Line 1 · 52973810

Mario Huettel authored Apr 08, 2017

* Disabled interrupt line 1. The driver didn't use it.
Signed-off-by: Mario Huettel <mario.huettel@gmx.net>
Tested-by: Quentin Schulz <quentin.schulz@free-electrons.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

52973810

can: peak: add support for PEAK PCAN-PCIe FD CAN-FD boards · 8ac8321e

Stephane Grosjean authored Jan 19, 2017

This patch adds the support of the PCAN-PCI Express FD boards made
by PEAK-System, for computers using the PCI Express slot.

The PCAN-PCI Express FD has one or two CAN FD channels, depending
on the model. A galvanic isolation of the CAN ports protects
the electronics of the card and the respective computer against
disturbances of up to 500 Volts. The PCAN-PCI Express FD can be operated
with ambient temperatures in a range of -40 to +85 °C.

Such boards run an extented version of the CAN-FD IP running into USB
CAN-FD interfaces from PEAK-System, so this patch adds several new commands
and their corresponding data types to the PEAK CAN-FD common definitions
header file too.
Signed-off-by: Stephane Grosjean <s.grosjean@peak-system.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

8ac8321e

can: peak: move header file to new can common subdir · c3df7c57

Stephane Grosjean authored Jan 19, 2017

The CAN-FD IP from PEAK-System runs into several kinds of PC CAN-FD
interfaces. Up to now, only the USB CAN-FD adapters were supported by
the Kernel. In order to prepare the adding of some new non-USB CAN-FD
interfaces, this patch moves - and rename - the IP definitions file
from its private (usb) sub-directory into a - newly created - CAN specific
one.
Signed-off-by: Stephane Grosjean <s.grosjean@peak-system.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

c3df7c57

can: peak: fix usage of const qualifier in pointers args · 113ab88b

Stephane Grosjean authored Jan 19, 2017

Fixes the usage of the const qualifier in the memory pointer arguments
of the declared inline functions. By changing the line containing "const",
this patch also changes the name of the arg into a more usual one.
Signed-off-by: Stephane Grosjean <s.grosjean@peak-system.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

113ab88b

can: peak: fix usage of usb specific data type · 81c5e13d

Stephane Grosjean authored Jan 19, 2017

This patch fixes the wrong usage of a specific USB data type into a common
header file. This common header file is intended to define the common data
types and values that define access to the PEAK-System CAN-FD IP, whatever
the PC interface is.
Signed-off-by: Stephane Grosjean <s.grosjean@peak-system.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

81c5e13d

Merge branch 'virtio-net-tx-napi' · 86a5df14

David S. Miller authored Apr 24, 2017

Willem de Bruijn says:

====================
virtio-net tx napi

Add napi for virtio-net transmit completion processing.

Changes:
  v2 -> v3:
    - convert __netif_tx_trylock to __netif_tx_lock on tx napi poll
          ensure that the handler always cleans, to avoid deadlock
    - unconditionally clean in start_xmit
          avoid adding an unnecessary "if (use_napi)" branch
    - remove virtqueue_disable_cb in patch 5/5
          a noop in the common event_idx based loop
    - document affinity_hint_set constraint

  v1 -> v2:
    - disable by default
    - disable unless affinity_hint_set
          because cache misses add up to a third higher cycle cost,
	  e.g., in TCP_RR tests. This is not limited to the patch
	  that enables tx completion cleaning in rx napi.
    - use trylock to avoid contention between tx and rx napi
    - keep interrupts masked during xmit_more (new patch 5/5)
          this improves cycles especially for multi UDP_STREAM, which
	  does not benefit from cleaning tx completions on rx napi.
    - move free_old_xmit_skbs (new patch 3/5)
          to avoid forward declaration

    not changed:
    - deduplicate virnet_poll_tx and virtnet_poll_txclean
          they look similar, but have differ too much to make it
	  worthwhile.
    - delay netif_wake_subqueue for more than 2 + MAX_SKB_FRAGS
          evaluated, but made no difference
    - patch 1/5

  RFC -> v1:
    - dropped vhost interrupt moderation patch:
          not needed and likely expensive at light load
    - remove tx napi weight
        - always clean all tx completions
        - use boolean to toggle tx-napi, instead
    - only clean tx in rx if tx-napi is enabled
        - then clean tx before rx
    - fix: add missing braces in virtnet_freeze_down
    - testing: add 4KB TCP_RR + UDP test results

Based on previous patchsets by Jason Wang:

  [RFC V7 PATCH 0/7] enable tx interrupts for virtio-net
  http://lkml.iu.edu/hypermail/linux/kernel/1505.3/00245.html

Before commit b0c39dbd ("virtio_net: don't free buffers in xmit
ring") the virtio-net driver would free transmitted packets on
transmission of new packets in ndo_start_xmit and, to catch the edge
case when no new packet is sent, also in a timer at 10HZ.

A timer can cause long stalls. VIRTIO_F_NOTIFY_ON_EMPTY avoids stalls
due to low free descriptor count. It does not address a stalls due to
low socket SO_SNDBUF. Increasing timer frequency decreases that stall
time, but increases interrupt rate and, thus, cycle count.

Currently, with no timer, packets are freed only at ndo_start_xmit.
Latency of consume_skb is now unbounded. To avoid a deadlock if a sock
reaches SO_SNDBUF, packets are orphaned on tx. This breaks TCP small
queues.

Reenable TCP small queues by removing the orphan. Instead of using a
timer, convert the driver to regular tx napi. This does not have the
unresolved stall issue and does not have any frequency to tune.

By keeping interrupts enabled by default, napi increases tx
interrupt rate. VIRTIO_F_EVENT_IDX avoids sending an interrupt if
one is already unacknowledged, so makes this more feasible today.
Combine that with an optimization that brings interrupt rate
back in line with the existing version for most workloads:

Tx completion cleaning on rx interrupts elides most explicit tx
interrupts by relying on the fact that many rx interrupts fire.

Tested by running {1, 10, 100} {TCP, UDP} STREAM, RR, 4K_RR benchmarks
from a guest to a server on the host, on an x86_64 Haswell. The guest
runs 4 vCPUs pinned to 4 cores. vhost and the test server are
pinned to a core each.

All results are the median of 5 runs, with variance well < 10%.
Used neper (github.com/google/neper) as test process.

Napi increases single stream throughput, but increases cycle cost.
The optimizations bring this down. The previous patchset saw a
regression with UDP_STREAM, which does not benefit from cleaning tx
interrupts in rx napi. This regression is now gone for 10x, 100x.
Remaining difference is higher 1x TCP_STREAM, lower 1x UDP_STREAM.

The latest results are with process, rx napi and tx napi affine to
the same core. All numbers are lower than the previous patchset.

             upstream     napi
TCP_STREAM:
1x:
  Mbps          27816    39805
  Gcycles         274      285

10x:
  Mbps          42947    42531
  Gcycles         300      296

100x:
  Mbps          31830    28042
  Gcycles         279      269

TCP_RR Latency (us):
1x:
  p50              21       21
  p99              27       27
  Gcycles         180      167

10x:
  p50              40       39
  p99              52       52
  Gcycles         214      211

100x:
  p50             281      241
  p99             411      337
  Gcycles         218      226

TCP_RR 4K:
1x:
  p50              28       29
  p99              34       36
  Gcycles         177      167

10x:
  p50              70       71
  p99              85      134
  Gcycles         213      214

100x:
  p50             442      611
  p99             802      785
  Gcycles         237      216

UDP_STREAM:
1x:
  Mbps          29468    26800
  Gcycles         284      293

10x:
  Mbps          29891    29978
  Gcycles         285      312

100x:
  Mbps          30269    30304
  Gcycles         318      316

UDP_RR:
1x:
  p50              19       19
  p99              23       23
  Gcycles         180      173

10x:
  p50              35       40
  p99              54       64
  Gcycles         245      237

100x:
  p50             234      286
  p99             484      473
  Gcycles         224      214

Note that GSO is enabled, so 4K RR still translates to one packet
per request.

Lower throughput at 100x vs 10x can be (at least in part)
explained by looking at bytes per packet sent (nstat). It likely
also explains the lower throughput of 1x for some variants.

upstream:

 N=1   bytes/pkt=16581
 N=10  bytes/pkt=61513
 N=100 bytes/pkt=51558

at_rx:

 N=1   bytes/pkt=65204
 N=10  bytes/pkt=65148
 N=100 bytes/pkt=56840
====================
Acked-by: Michael S. Tsirkin <mst@redhat.com>

86a5df14

virtio-net: keep tx interrupts disabled unless kick · bdb12e0d

Willem de Bruijn authored Apr 24, 2017

Tx napi mode increases the rate of transmit interrupts. Suppress some
by masking interrupts while more packets are expected. The interrupts
will be reenabled before the last packet is sent.

This optimization reduces the througput drop with tx napi for
unidirectional flows such as UDP_STREAM that do not benefit from
cleaning tx completions in the the receive napi handler.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bdb12e0d