Commits · c61fb99cae51958a9096d8540c8c05e74cfa7e59 · Kirill Smelkov / linux

07 Feb, 2017 40 commits

bnxt_en: Add RX page mode support. · c61fb99c

Michael Chan authored Feb 06, 2017

This mode is to support XDP.  In this mode, each rx ring is configured
with page sized buffers for linear placement of each packet.  MTU will be
restricted to what the page sized buffers can support.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c61fb99c

bnxt_en: Parameterize RX buffer offsets. · b3dba77c

Michael Chan authored Feb 06, 2017

Convert the global constants BNXT_RX_OFFSET and BNXT_RX_DMA_OFFSET to
device parameters.  This will make it easier to support XDP with
headroom support which requires different RX buffer offsets.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b3dba77c

bnxt_en: Add bp->rx_dir field for rx buffer DMA direction. · 745fc05c

Michael Chan authored Feb 06, 2017

When driver is running in XDP mode, rx buffers are DMA mapped as
DMA_BIDIRECTIONAL.  Add a field so the code will map/unmap rx buffers
according to this field.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

745fc05c

bnxt_en: Don't use DEFINE_DMA_UNMAP_ADDR to store DMA address in RX path. · 11cd119d

Michael Chan authored Feb 06, 2017

To support XDP_TX, we need the RX buffer's DMA address to transmit the
packet.  Convert the DMA address field to a permanent field in
bnxt_sw_rx_bd.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

11cd119d

bnxt_en: Refactor rx SKB function. · 6bb19474

Michael Chan authored Feb 06, 2017

Minor refactoring of bnxt_rx_skb() so that it can easily be replaced by
a new function that handles packets in a single page.  Also, use a
function pointer bp->rx_skb_func() to switch to a new function when
we add the new mode in the next patch.

Add a new field data_ptr that points to the packet data in the
bnxt_sw_rx_bd structure.  The original data field is changed to void
pointer so that it can either hold the kmalloc'ed data or a page
pointer.

The last parameter of bnxt_rx_skb() which was the length parameter is
changed to include the payload offset of the packet in the upper 16 bit.
The offset is needed to support the rx page mode and is not used in
this existing function.

v3: Added a new data_ptr parameter to bp->rx_skb_func().  The caller
has the option to modify the starting address of the packet.  This
will be needed when XDP with headroom support is added.

v2: Changed the name of the last parameter to offset_and_len to make the
code more clear.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6bb19474

net: qcom/emac: add ethool support for setting pause parameters · b44700e9

Timur Tabi authored Feb 06, 2017

To support setting the pause parameters, the driver can no longer just
mirror the PHY.  The set_pauseparam feature allows the driver to
force the setting in the MAC, regardless of how the PHY is configured.
This means that we now need to maintain an internal state for pause
frame support, and so get_pauseparam also needs to be updated.

If the interface is already running when the setting is changed, then
the interface is reset.

Note that if the MAC is configured to enable RX pause frame support
(i.e. it transmits pause frames to throttle the other end), but the
PHY is configured to block those frames, then the feature will not work.

Also some buffer size initialization code into emac_init_adapter(),
so that it lives with similar code, including the initializtion of
pause frame support.
Signed-off-by: Timur Tabi <timur@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

b44700e9

Merge branch 'replace-dst_confirm' · 29ba6e74

David S. Miller authored Feb 07, 2017

Julian Anastasov says:

====================
net: dst_confirm replacement

	This patchset addresses the problem of neighbour
confirmation where received replies from one nexthop
can cause confirmation of different nexthop when using
the same dst. Thanks to YueHaibing <yuehaibing@huawei.com>
for tracking the dst->pending_confirm problem.

	Sockets can obtain cached output route. Such
routes can be to known nexthop (rt_gateway=IP) or to be
used simultaneously for different nexthop IPs by different
subnet prefixes (nh->nh_scope = RT_SCOPE_HOST, rt_gateway=0).

	At first look, there are more problems:

- dst_confirm() sets flag on dst and not on dst->path,
as result, indication is lost when XFRM is used

- DNAT can change the nexthop, so the really used nexthop is
not confirmed

	So, the following solution is to avoid using
dst->pending_confirm.

	The current dst_confirm() usage is as follows:

Protocols confirming dst on received packets:
- TCP (1 dst per socket)
- SCTP (1 dst per transport)
- CXGB*

Protocols supporting sendmsg with MSG_CONFIRM [ | MSG_PROBE ] to
confirm neighbour:
- UDP IPv4/IPv6
- ICMPv4 PING
- RAW IPv4/IPv6
- L2TP/IPv6

MSG_CONFIRM for other purposes (fix not needed):
- CAN

Sending without locking the socket:
- UDP (when no cork)
- RAW (when hdrincl=1)

Redirects from old to new GW:
- rt6_do_redirect

	The patchset includes the following changes:

1. sock: add sk_dst_pending_confirm flag

- used only by TCP with patch 4 to remember the received
indication in sk->sk_dst_pending_confirm

2. net: add dst_pending_confirm flag to skbuff

- skb->dst_pending_confirm will be used by all protocols
in following patches, via skb_{set,get}_dst_pending_confirm

3. sctp: add dst_pending_confirm flag

- SCTP uses per-transport dsts and can not use
sk->sk_dst_pending_confirm like TCP

4. tcp: replace dst_confirm with sk_dst_confirm

5. net: add confirm_neigh method to dst_ops

- IPv4 and IPv6 provision for slow neigh lookups for MSG_PROBE users.
I decided to use neigh lookup only for this case because on
MSG_PROBE the skb may pass MTU checks but it does not reach
the neigh confirmation code. This patch will be used from patch 6.

- xfrm_confirm_neigh: we use the last tunnel address, if present.
When there are only transports, the original dest address is used.

6. net: use dst_confirm_neigh for UDP, RAW, ICMP, L2TP

- dst_confirm conversion for UDP, RAW, ICMP and L2TP/IPv6

- these protocols use MSG_CONFIRM propagated by ip*_append_data
to skb->dst_pending_confirm. sk->sk_dst_pending_confirm is not
used because some sending paths do not lock the socket. For
MSG_PROBE we use the slow lookup (dst_confirm_neigh).

- there are also 2 cases that need the slow lookup:
__ip6_rt_update_pmtu and rt6_do_redirect. I hope
&ipv6_hdr(skb)->saddr is the correct nexthop address to use here.

7. net: pending_confirm is not used anymore

- I failed to understand the CXGB* code, I see dst_confirm()
calls but I'm not sure dst_neigh_output() was called. For now
I just removed the dst->pending_confirm flag and left all
dst_confirm() calls there. Any better idea?

- Now may be old function neigh_output() should be restored
instead of dst_neigh_output?
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

29ba6e74

net: pending_confirm is not used anymore · 51ce8bd4

Julian Anastasov authored Feb 06, 2017

When same struct dst_entry can be used for many different
neighbours we can not use it for pending confirmations.
As last step, we can remove the pending_confirm flag.
Reported-by: YueHaibing <yuehaibing@huawei.com>
Fixes: 5110effe ("net: Do delayed neigh confirmation.")
Fixes: f2bb4bed ("ipv4: Cache output routes in fib_info nexthops.")
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

51ce8bd4

net: use dst_confirm_neigh for UDP, RAW, ICMP, L2TP · 0dec879f

Julian Anastasov authored Feb 06, 2017

When same struct dst_entry can be used for many different
neighbours we can not use it for pending confirmations.

The datagram protocols can use MSG_CONFIRM to confirm the
neighbour. When used with MSG_PROBE we do not reach the
code where neighbour is confirmed, so we have to do the
same slow lookup by using the dst_confirm_neigh() helper.
When MSG_PROBE is not used, ip_append_data/ip6_append_data
will set the skb flag dst_pending_confirm.
Reported-by: YueHaibing <yuehaibing@huawei.com>
Fixes: 5110effe ("net: Do delayed neigh confirmation.")
Fixes: f2bb4bed ("ipv4: Cache output routes in fib_info nexthops.")
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0dec879f

net: add confirm_neigh method to dst_ops · 63fca65d

Julian Anastasov authored Feb 06, 2017

Add confirm_neigh method to dst_ops and use it from IPv4 and IPv6
to lookup and confirm the neighbour. Its usage via the new helper
dst_confirm_neigh() should be restricted to MSG_PROBE users for
performance reasons.

For XFRM prefer the last tunnel address, if present. With help
from Steffen Klassert.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

63fca65d

tcp: replace dst_confirm with sk_dst_confirm · c3a2e837

Julian Anastasov authored Feb 06, 2017

When same struct dst_entry can be used for many different
neighbours we can not use it for pending confirmations.
Use the new sk_dst_confirm() helper to propagate the
indication from received packets to sock_confirm_neigh().
Reported-by: YueHaibing <yuehaibing@huawei.com>
Fixes: 5110effe ("net: Do delayed neigh confirmation.")
Fixes: f2bb4bed ("ipv4: Cache output routes in fib_info nexthops.")
Tested-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c3a2e837

sctp: add dst_pending_confirm flag · c86a773c

Julian Anastasov authored Feb 06, 2017

Add new transport flag to allow sockets to confirm neighbour.
When same struct dst_entry can be used for many different
neighbours we can not use it for pending confirmations.
The flag is propagated from transport to every packet.
It is reset when cached dst is reset.
Reported-by: YueHaibing <yuehaibing@huawei.com>
Fixes: 5110effe ("net: Do delayed neigh confirmation.")
Fixes: f2bb4bed ("ipv4: Cache output routes in fib_info nexthops.")
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c86a773c

net: add dst_pending_confirm flag to skbuff · 4ff06203

Julian Anastasov authored Feb 06, 2017

Add new skbuff flag to allow protocols to confirm neighbour.
When same struct dst_entry can be used for many different
neighbours we can not use it for pending confirmations.

Add sock_confirm_neigh() helper to confirm the neighbour and
use it for IPv4, IPv6 and VRF before dst_neigh_output.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4ff06203

sock: add sk_dst_pending_confirm flag · 9b8805a3

Julian Anastasov authored Feb 06, 2017

Add new sock flag to allow sockets to confirm neighbour.
When same struct dst_entry can be used for many different
neighbours we can not use it for pending confirmations.
As not all call paths lock the socket use full word for
the flag.

Add sk_dst_confirm as replacement for dst_confirm when
called for received packets.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9b8805a3

net: phy: bcm7xxx: Add BCM74371 PHY ID · b08d46b0

Florian Fainelli authored Feb 06, 2017

Add the BCM74371 PHY ID to the list of supported chips. This is a 28nm
technology Gigabit PHY SoC.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b08d46b0

mlxsw: add psample dependency for spectrum · 8d1fb01d

Arnd Bergmann authored Feb 06, 2017

When PSAMPLE is a loadable module, spectrum must not be built-in:

drivers/net/built-in.o: In function `mlxsw_sp_rx_listener_sample_func':
spectrum.c:(.text+0xe357e): undefined reference to `psample_sample_packet'

This adds a Kconfig dependency to enforce usable configurations.

Fixes: 98d0f7b9 ("mlxsw: spectrum: Add packet sample offloading support")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Yotam Gigi <yotamg@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8d1fb01d

ipv6: sr: fix non static symbol warnings · bb4005ba

Wei Yongjun authored Feb 06, 2017

Fixes the following sparse warnings:

net/ipv6/seg6_iptunnel.c:58:5: warning:
 symbol 'nla_put_srh' was not declared. Should it be static?
net/ipv6/seg6_iptunnel.c:238:5: warning:
 symbol 'seg6_input' was not declared. Should it be static?
net/ipv6/seg6_iptunnel.c:254:5: warning:
 symbol 'seg6_output' was not declared. Should it be static?
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bb4005ba

net/sched: act_mirred: remove duplicated include from act_mirred.c · 89d82452

Wei Yongjun authored Feb 06, 2017

Remove duplicated include.
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

89d82452

net: wan: slic_ds26522: Remove .owner field for driver · fee40221

Wei Yongjun authored Feb 06, 2017

Remove .owner field if calls are used which set it automatically.

Generated by: scripts/coccinelle/api/platform_no_drv_owner.cocci
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fee40221

net: wan: slic_ds26522: Use module_spi_driver to simplify the code · c3afa995

Wei Yongjun authored Feb 06, 2017

module_spi_driver() makes the code simpler by eliminating
boilerplate code.
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c3afa995

Merge branch 'dsa2-pdata' · 521613c5

David S. Miller authored Feb 07, 2017

Florian Fainelli says:

====================
net: dsa: Support for pdata in dsa2

This is not exactly new, and was sent before, although back then, I did not
have an user of the pre-declared MDIO board information, but now we do. Note
that I have additional changes queued up to have b53 register platform data for
MIPS bcm47xx and bcm63xx.

Yes I know that we should have the Orion platforms eventually be converted to
Device Tree, but until that happens, I don't want any remaining users of the
old "dsa" platform device (hence the previous DTS submissions for ARM/mvebu)
and, there will be platforms out there that most likely won't never see DT
coming their way (BCM47xx is almost 100% sure, BCM63xx maybe not in a distant
future).

We would probably want the whole series to be merged via David Miller's tree
to simplify things.

Thanks!

Changes in v5:

- dropped changes to drivers/base/ because after more than a month, we cannot
  get any answer from Greg KH

Changes in v4:

- Changed device_find_class() to device_find_in_class_name()
- Added kerneldoc above device_find_in_class_name() to explain what it does
  and the calling convention regarding device reference counts
- Changed dev_to_net_device to device_to_net_device() added comments
  about what it does and the caller conventions regarding reference counts

Changes in v3:

- Tested EPROBE_DEFER from a mockup MDIO/DSA switch driver and everything
  is fine, once the driver finally probes we have access to platform data
  as expected

- added comment above dsa_port_is_valid() that port->name is mandatory
  for platform data cases

- added an extra check in dsa_parse_member() for a NULL pdata pointer

- fixed a bunch of checkpatch errors and warnings

Changes in v2:

- Rebased against latest net-next/master

- Moved dev_find_class() to device_find_class() into drivers/base/core.c

- Moved dev_to_net_device into net/core/dev.c

- Utilize dsa_chip_data directly instead of dsa_platform_data

- Augmented dsa_chip_data to be multi-CPU port ready

Changes from last submission (few months back):

- rebased against latest net-next

- do not introduce dsa2_platform_data which was overkill and was meant to
  allow us to do exaclty the same things with platform data and Device Tree
  we use the existing dsa_platform_data instead

- properly register MDIO devices when the MDIO bus is registered and associate
  platform_data with them

- add a change to the Orion platform code to demonstrate how this can be used

Thank you
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

521613c5

ARM: orion: Register DSA switch as a MDIO device · 575e93f7

Florian Fainelli authored Feb 04, 2017

Utilize the ability to pass board specific MDIO bus information towards a
particular MDIO device thus allowing us to provide the per-port switch layout
to the Marvell 88E6XXX switch driver.

Since we would end-up with conflicting registration paths, do not register the
"dsa" platform device anymore.

Note that the MDIO devices registered by code in net/dsa/dsa2.c does not
parse a dsa_platform_data, but directly take a dsa_chip_data (specific
to a single switch chip), so we update the different call sites to pass
this structure down to orion_ge00_switch_init().
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

575e93f7

net: phy: Allow pre-declaration of MDIO devices · 648ea013

Florian Fainelli authored Feb 04, 2017

Allow board support code to collect pre-declarations for MDIO devices by
registering them with mdiobus_register_board_info(). SPI and I2C buses
have a similar feature, we were missing this for MDIO devices, but this
is particularly useful for e.g: MDIO-connected switches which need to
provide their port layout (often board-specific) to a MDIO Ethernet
switch driver.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

648ea013

net: dsa: Add support for platform data · 71e0bbde

Florian Fainelli authored Feb 04, 2017

Allow drivers to use the new DSA API with platform data. Most of the
code in net/dsa/dsa2.c does not rely so much on device_nodes and can get
the same information from platform_data instead.

We purposely do not support distributed configurations with platform
data, so drivers should be providing a pointer to a 'struct
dsa_chip_data' structure if they wish to communicate per-port layout.

Multiple CPUs port could potentially be supported and dsa_chip_data is
extended to receive up to one reference to an upstream network device
per port described by a dsa_chip_data structure.

dsa_dev_to_net_device() increments the network device's reference count,
so we intentionally call dev_put() to be consistent with the DT-enabled
path, until we have a generic notifier based solution.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

71e0bbde

net: dsa: Rename and export dev_to_net_device() · 14b89f36

Florian Fainelli authored Feb 04, 2017

In preparation for using this function in net/dsa/dsa2.c, rename the function
to make its scope DSA specific, and export it.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

14b89f36

net: dsa: mv88e6xxx: Refactor remaining port setup · a23b2961

Andrew Lunn authored Feb 04, 2017

Move the remaining port configuration code which varies per device
into port.c, using ops were necessary. This makes
mv88e6xxx_6185_family() and mv88e6xxx_6095_family() unused, so remove
them.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a23b2961

net: dsa: mv88e6xxx: Implement Clause 45 access to SMI devices · cf3e80df

Andrew Lunn authored Feb 04, 2017

The mv88e6390 MDIO bus controllers can support for clause 45 accesses.
The internal SERDES interfaces need this, and it is likely external
10GHz PHYs will be clause 45.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cf3e80df

Merge branch 'mv88e6390-CMODE' · 8661a631

David S. Miller authored Feb 07, 2017

Andrew Lunn says:

====================
Set the CMODE for mv88e6390 ports

The mv88e6390 ports 9 & 10 allow there CMODE to be set. CMODE is part
of what linux defines as phy-mode. Add the needed phy-modes to linux,
and add code which will act upon the phy-mode property to configure
the switch port.

These patches have been posted before as part of a bigger patchset
which has now been broken up. I've added the received reviewed by
tags, and added device tree documentation.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

8661a631

net: dsa: mv88e6xxx: Set the CMODE for mv88e6390 ports 9 & 10 · f39908d3

Andrew Lunn authored Feb 04, 2017

Unlike most ports, ports 9 and 10 of the 6390X family have configurable
PHY modes. Set the mode as part of adjust_link().

Ordering is important, because the SERDES interfaces connected to
ports 9 and 10 can be split and assigned to other ports. The CMODE has
to be correctly set before the SERDES interface on another port can be
configured. Such configuration is likely to be performed in
port_enable() and port_disabled(), called on slave_open() and
slave_close().

The simple case is port 9 and 10 are used for 'CPU' or 'DSA'. In this
case, the CMODE is set via a phy-mode in dsa_cpu_dsa_setup(), which is
called early in the switch setup.

When ports 9 or 10 are used as user ports, and have a fixed-phy, when
the fixed fixed-phy is attached, dsa_slave_adjust_link() is called,
which results in the adjust_link function being called, setting the
cmode. The port_enable() will for other ports will be called much
later.

When ports 9 or 10 are used as user ports and have a real phy attached
which does not use all the available SERDES interface, e.g. a 1Gbps
SGMII, there is currently no mechanism in place to set the CMODE of
the port from software. It must be hoped the stripping resistors are
correct.

At the same time, add a function to get the cmode. This will be needed
when configuring the SERDES interfaces.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f39908d3

net: phy: Add 2000base-x, 2500base-x and rxaui modes · 55601a88

Andrew Lunn authored Feb 04, 2017

The mv88e6390 ports 9 and 10 supports some additional PHY modes. Add
these modes to the PHY core so they can be used in the binding.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

55601a88

Merge branch 'virtio_net-XDP-adjust_head' · 108d9c71

David S. Miller authored Feb 07, 2017

John Fastabend says:

====================
XDP adjust head support for virtio

This series adds adjust head support for virtio. The following is my
test setup. I use qemu + virtio as follows,

./x86_64-softmmu/qemu-system-x86_64 \
  -hda /var/lib/libvirt/images/Fedora-test0.img \
  -m 4096  -enable-kvm -smp 2 -netdev tap,id=hn0,queues=4,vhost=on \
  -device virtio-net-pci,netdev=hn0,mq=on,guest_tso4=off,guest_tso6=off,guest_ecn=off,guest_ufo=off,vectors=9

In order to use XDP with virtio until LRO is supported TSO must be
turned off in the host. The important fields in the above command line
are the following,

  guest_tso4=off,guest_tso6=off,guest_ecn=off,guest_ufo=off

Also note it is possible to conusme more queues than can be supported
because when XDP is enabled for retransmit XDP attempts to use a queue
per cpu. My standard queue count is 'queues=4'.

After loading the VM I run the relevant XDP test programs in,

  ./sammples/bpf

For this series I tested xdp1, xdp2, and xdp_tx_iptunnel. I usually test
with iperf (-d option to get bidirectional traffic), ping, and pktgen.
I also have a modified xdp1 that returns XDP_PASS on any packet to ensure
the normal traffic path to the stack continues to work with XDP loaded.

It would be great to automate this soon. At the moment I do it by hand
which is starting to get tedious.

v2: original series dropped trace points after merge.
====================
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

108d9c71

virtio_net: XDP support for adjust_head · 2de2f7f4

John Fastabend authored Feb 02, 2017

Add support for XDP adjust head by allocating a 256B header region
that XDP programs can grow into. This is only enabled when a XDP
program is loaded.

In order to ensure that we do not have to unwind queue headroom push
queue setup below bpf_prog_add. It reads better to do a prog ref
unwind vs another queue setup call.

At the moment this code must do a full reset to ensure old buffers
without headroom on program add or with headroom on program removal
are not used incorrectly in the datapath. Ideally we would only
have to disable/enable the RX queues being updated but there is no
API to do this at the moment in virtio so use the big hammer. In
practice it is likely not that big of a problem as this will only
happen when XDP is enabled/disabled changing programs does not
require the reset. There is some risk that the driver may either
have an allocation failure or for some reason fail to correctly
negotiate with the underlying backend in this case the driver will
be left uninitialized. I have not seen this ever happen on my test
systems and for what its worth this same failure case can occur
from probe and other contexts in virtio framework.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2de2f7f4

virtio_net: refactor freeze/restore logic into virtnet reset logic · 9fe7bfce

John Fastabend authored Feb 02, 2017

For XDP we will need to reset the queues to allow for buffer headroom
to be configured. In order to do this we need to essentially run the
freeze()/restore() code path. Unfortunately the locking requirements
between the freeze/restore and reset paths are different however so
we can not simply reuse the code.

This patch refactors the code path and adds a reset helper routine.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9fe7bfce

virtio_net: remove duplicate queue pair binding in XDP · 722d8283

John Fastabend authored Feb 02, 2017

Factor out qp assignment.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

722d8283

virtio_net: factor out xdp handler for readability · 0354e4d1

John Fastabend authored Feb 02, 2017

At this point the do_xdp_prog is mostly if/else branches handling
the different modes of virtio_net. So remove it and handle running
the program in the per mode handlers.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0354e4d1

virtio_net: wrap rtnl_lock in test for calling with lock already held · 47315329

John Fastabend authored Feb 02, 2017

For XDP use case and to allow ethtool reset tests it is useful to be
able to use reset paths from contexts where rtnl lock is already
held.

This requries updating virtnet_set_queues and free_receive_bufs the
two places where rtnl_lock is taken in virtio_net. To do this we
use the following pattern,

	_foo(...) { do stuff }
	foo(...) { rtnl_lock(); _foo(...); rtnl_unlock()};

this allows us to use freeze()/restore() flow from both contexts.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

47315329

Merge branch 'bridge-improve-cache-utilization' · 152bff37

David S. Miller authored Feb 06, 2017

Nikolay Aleksandrov says:

====================
bridge: improve cache utilization

This is the first set which begins to deal with the bad bridge cache
access patterns. The first patch rearranges the bridge and port structs
a little so the frequently (and closely) accessed members are in the same
cache line. The second patch then moves the garbage collection to a
workqueue trying to improve system responsiveness under load (many fdbs)
and more importantly removes the need to check if the matched entry is
expired in __br_fdb_get which was a major source of false-sharing.
The third patch is a preparation for the final one which
If properly configured, i.e. ports bound to CPUs (thus updating "updated"
locally) then the bridge's HitM goes from 100% to 0%, but even without
binding we get a win because previously every lookup that iterated over
the hash chain caused false-sharing due to the first cache line being
used for both mac/vid and used/updated fields.

Some results from tests I've run:
(note that these were run in good conditions for the baseline, everything
 ran on a single NUMA node and there were only 3 fdbs)

1. baseline
100% Load HitM on the fdbs (between everyone who has done lookups and hit
                            one of the 3 hash chains of the communicating
                            src/dst fdbs)
Overall 5.06% Load HitM for the bridge, first place in the list

2. patched & ports bound to CPUs
0% Local load HitM, bridge is not even in the c2c report list
Also there's 3% consistent improvement in netperf tests.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

152bff37

bridge: fdb: write to used and updated at most once per jiffy · 83a718d6

Nikolay Aleksandrov authored Feb 04, 2017

Writing once per jiffy is enough to limit the bridge's false sharing.
After this change the bridge doesn't show up in the local load HitM stats.
Suggested-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

83a718d6

bridge: move write-heavy fdb members in their own cache line · 1214628c

Nikolay Aleksandrov authored Feb 04, 2017

Fdb's used and updated fields are written to on every packet forward and
packet receive respectively. Thus if we are receiving packets from a
particular fdb, they'll cause false-sharing with everyone who has looked
it up (even if it didn't match, since mac/vid share cache line!). The
"used" field is even worse since it is updated on every packet forward
to that fdb, thus the standard config where X ports use a single gateway
results in 100% fdb false-sharing. Note that this patch does not prevent
the last scenario, but it makes it better for other bridge participants
which are not using that fdb (and are only doing lookups over it).
The point is with this move we make sure that only communicating parties
get the false-sharing, in a later patch we'll show how to avoid that too.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1214628c

bridge: move to workqueue gc · f7cdee8a

Nikolay Aleksandrov authored Feb 04, 2017

Move the fdb garbage collector to a workqueue which fires at least 10
milliseconds apart and cleans chain by chain allowing for other tasks
to run in the meantime. When having thousands of fdbs the system is much
more responsive. Most importantly remove the need to check if the
matched entry has expired in __br_fdb_get that causes false-sharing and
is completely unnecessary if we cleanup entries, at worst we'll get 10ms
of traffic for that entry before it gets deleted.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f7cdee8a