Commits · 6857a02af5386e9f5d11734363741dbe6b0a6959 · Kirill Smelkov / linux

16 Dec, 2015 16 commits

sctp: use GFP_KERNEL in sctp_init() · 6857a02a

Eric Dumazet authored Dec 15, 2015

modules init functions being called from process context, we better
use GFP_KERNEL allocations to increase our chances to get these
high-order pages we want for SCTP hash tables.

This mostly matters if SCTP module is loaded once memory got fragmented.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6857a02a

Merge branch 'sock-diag-destroy' · 5cfe6d8a

David S. Miller authored Dec 15, 2015

Lorenzo Colitti says:

====================
Support administratively closing application sockets

This patchset adds the ability to administratively close a socket
without any action from the process owning the socket or the
socket protocol.

It implements this by adding a new diag_destroy function pointer
to struct proto. In-kernel callers can access this functionality
directly by calling sk->sk_prot->diag_destroy(sk, err).

It also exposes this functionality to userspace via a new
SOCK_DESTROY operation in the NETLINK_SOCK_DIAG sockets. This
allows a privileged userspace process, such as a connection
manager or system administration tool, to close sockets belonging
to other apps when the network they were established on has
disconnected. It is needed on laptops and mobile hosts to ensure
that network switches / disconnects do not result in applications
being blocked for long periods of time (minutes) in read or
connect calls on TCP sockets that will never succeed because the
IP address they are bound to is no longer on the system. Closing
the sockets causes these calls to fail fast and allows the apps
to reconnect on another network.

Userspace intervention is necessary because in many cases the
kernel does not have enough information to know that a connection
is now inoperable. The kernel can know if a packet can't be
routed, but in general it won't know if a TCP connection is stuck
because it is now routed to a network where its source address is
no longer valid [5][6].

Many other operating systems offer similar functionality:

 - FreeBSD has had this since 5.4 in 2005 [2]. It is available
   to privileged userspace and there is a tool to use it [3].
 - The FreeBSD commit description states that the idea came
   from OpenBSD.
 - iOS has been administratively closing app sockets since
   iOS 4 - see [4], which states that a socket "might get
   reclaimed by the kernel" and after that will return EBADF].
   For many years Android kernels have supported this via an
   out-of-tree SIOCKILLADDR ioctl that is called on every
   RTM_DELADDR event, but this solution is cleaner, more robust
   and more flexible: the connection manager can iterate over all
   connections on the deleted IP address and close all of them.
   It can also be used to close all sockets opened by a given app
   process, for example if the user has restricted that app from
   using the network, if a secure network such as a VPN has
   connected and security policy requires all of an application's
   connections to be routed via the VPN, etc.
 - For many years Android kernels have supported an out-of-tree
   SIOCKILLADDR ioctl that is called when a network disconnects
   or an RTM_DELADDR event is received. This solution is cleaner,
   more robust and more flexible. The connection manager can
   implement SIOCKILLADDR by iterating over all connections on
   the deleted IP address and close all of them, but it can also
   close all sockets opened by a given app process (for example
   if the user has restricted that app from), close all of a
   user's TCP connections if a user has connected a secure
   network such as a VPN and expects all of an application's
   connections to be routed via the VPN, etc.

Alternative schemes such as TCP keepalives in combination with
"iptables -j REJECT --reject-with tcp-reset", could be used to
achieve similar results, but on mobile devices TCP keepalives are
very expensive, and in such a scheme detecting stuck connections
has to wait for a keepalive to be sent or the application to
perform a write. An explicit notification from userspace is
cheaper and faster in the common case where an application is
blocked on read.

SOCK_DESTROY is placed behind an INET_DIAG_DESTROY configuration
option, which is currently off by default.

The TCP implementation of diag_destroy causes a TCP ABORT as
specified by RFC 793 [1]: immediately send a RST and clear local
connection state. This is what happens today if an application
enables SO_LINGER with a timeout of 0 and then calls close.

The first versions of the patchset did not send a RST, but that
is not graceful/correct TCP behaviour. tcp_abort now does a
proper RFC 793 ABORT and sends a RST to the peer. This is
consistent with BSD's tcpdrop, and is more correct in general,
even though in many use cases tcp_abort will only be called when
sending a RST is no longer possible (e.g., the network has
disconnected).

The original patchset also behaved like SIOCKILADDR and closed
TCP sockets with ETIMEDOUT. Tom Herbert pointed out that it would
be better if applications could distinguish between a timeout and
an administrative close. ECONNABORTED was chosen because it is
consistent with BSD.

[1] http://tools.ietf.org/html/rfc793#page-50
[2] http://svnweb.freebsd.org/base?view=revision&revision=141381
[3] https://www.freebsd.org/cgi/man.cgi?query=tcpdrop&sektion=8&manpath=FreeBSD+5.4-RELEASE
[4] https://developer.apple.com/library/ios/technotes/tn2277/_index.html#//apple_ref/doc/uid/DTS40010841-CH1-SUBSECTION3
[5] http://www.spinics.net/lists/netdev/msg352775.html
[6] http://www.spinics.net/lists/netdev/msg352952.html
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

5cfe6d8a

net: diag: Support destroying TCP sockets. · c1e64e29

Lorenzo Colitti authored Dec 16, 2015

This implements SOCK_DESTROY for TCP sockets. It causes all
blocking calls on the socket to fail fast with ECONNABORTED and
causes a protocol close of the socket. It informs the other end
of the connection by sending a RST, i.e., initiating a TCP ABORT
as per RFC 793. ECONNABORTED was chosen for consistency with
FreeBSD.
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c1e64e29

net: diag: Support SOCK_DESTROY for inet sockets. · 6eb5d2e0

Lorenzo Colitti authored Dec 16, 2015

This passes the SOCK_DESTROY operation to the underlying protocol
diag handler, or returns -EOPNOTSUPP if that handler does not
define a destroy operation.

Most of this patch is just renaming functions. This is not
strictly necessary, but it would be fairly counterintuitive to
have the code to destroy inet sockets be in a function whose name
starts with inet_diag_get.
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6eb5d2e0

net: diag: Add the ability to destroy a socket. · 64be0aed

Lorenzo Colitti authored Dec 16, 2015

This patch adds a SOCK_DESTROY operation, a destroy function
pointer to sock_diag_handler, and a diag_destroy function
pointer.  It does not include any implementation code.
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

64be0aed

net: diag: split inet_diag_dump_one_icsk into two · b613f56e

Lorenzo Colitti authored Dec 16, 2015

Currently, inet_diag_dump_one_icsk finds a socket and then dumps
its information to userspace. Split it into a part that finds the
socket and a part that dumps the information.
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b613f56e

Merge branch 'ila-early-demux' · fec65bd4

David S. Miller authored Dec 15, 2015

Tom Herbert says:

====================
ila: Optimization to preserve value of early demux

In the current implementation of ILA, LWT is used to perform
translation on both the input and output paths. This is functional,
however there is a big performance hit in the receive path. Early
demux occurs before the routing lookup (a hit actually obviates the
route lookup). Therefore the stack currently performs early
demux before translation so that a local connection with ILA
addresses is never matched. Note that this issue is not just
with ILA, but pretty much any translated or encapsulated packet
handled by LWT would miss the opportunity for early demux. Solving
the general problem seems non trivial since we would need to move
the route lookup before early demx thereby mitigating the value.

This patch set addresses the issue for ILA by adding a fast locator
lookup that occurs before early demux. This done by hooking in to
NF_INET_PRE_ROUTING

For the backend we implement an rhashtable that contains identifier
to locator to mappings. The table also allows more specific matches
that include original locator and interface.

This patch set:
 - Add an rhashtable function to atomically replace and element.
   This is useful to implement sub-trees from a table entry
   without needing to use a special anchor structure as the
   table entry.
 - Add a start callback for starting a netlink dump.
 - Creates an ila directory under net/ipv6 and moves ila.c to it.
   ila.c is split into ila_common.c and ila_lwt.c.
 - Implement a table to do identifier->locator mapping. This is
   an rhashtable (in ila_xlat.c).
 - Configuration for the table with netlink.
 - Add a hook into NF_INET_PRE_ROUTING to perform ILA translation
   before early demux.

Changes in v2:
 - Use iptables targets instead of a new xfrm function

Changes in v3:
 - Add __rcu to next pointer in struct ila_map

Changes in v4:
 - Use hook for NF_INET_PRE_ROUTING

Changed in v5:
 - Register hooks per namespace using nf_register_net_hooks
 - Only register hooks when first mapping is actually added

Changed in v6:
  - Remove gfp argument in alloc_ila_locks, it is unnecessary
  - Set registered_hooks properly when hooks are registered

Testing:
   Running 200 netperf TCP_RR streams

No ILA, baseline
   79.26% CPU utilization
   1678282 tps
   104/189/390 50/90/99% latencies

ILA before fix (LWT on both input and output)
   81.91% CPU utilization
   1464723 tps (-14.5% from baseline)
   121/215/411 50/90/99% latencies

ILA after fix
   80.62% CPU utilization
   1622985 (-3.4% from baseline)
   110/191/347 50/90/99% latencies
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

fec65bd4

ila: Add generic ILA translation facility · 7f00feaf

Tom Herbert authored Dec 15, 2015

This patch implements an ILA tanslation table. This table can be
configured with identifier to locator mappings, and can be be queried
to resolve a mapping. Queries can be parameterized based on interface,
direction (incoming or outoing), and matching locator.  The table is
implemented using rhashtable and is configured via netlink (through
"ip ila .." in iproute).

The table may be used as alternative means to do do ILA tanslations
other than the lw tunnels
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7f00feaf

netlink: add a start callback for starting a netlink dump · fc9e50f5

Tom Herbert authored Dec 15, 2015

The start callback allows the caller to set up a context for the
dump callbacks. Presumably, the context can then be destroyed in
the done callback.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fc9e50f5

rhashtable: add function to replace an element · 3502cad7

Tom Herbert authored Dec 15, 2015

Add the rhashtable_replace_fast function. This replaces one object in
the table with another atomically. The hashes of the new and old objects
must be equal.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3502cad7

ila: Create net/ipv6/ila directory · 33f11d16

Tom Herbert authored Dec 15, 2015

Create ila directory in preparation for supporting other hooks in the
kernel than LWT for doing ILA. This includes:
  - Moving ila.c to ila/ila_lwt.c
  - Splitting out some common functions into ila_common.c
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

33f11d16

Merge branch 'stmmac-mdio-compat' · 3026043d

David S. Miller authored Dec 15, 2015

Merge branch 'stmmac-mdio-compat'

Phil Reid says:

====================
stmmac: create of compatible mdio bus for stmacc driver

Provide ability to specify a fixed phy in the device tree and
retain the mdio bus if no phy is found. This is needed where
a dsa is connected via a fixed phy and uses the mdio bus for config.
Fixed ptp ref clock calculatins for the stmmac when ptp ref clock
is running at <= 50Mhz. Also add device tree setting to config
ptp clk source on socfpga platforms.

Changes from V5:
- Restore behaviour of unregister mdio bus when no phys found
  if there is no device tree node create the bus.
- Modify condition to allocate mdio_base_data conditional
  on fixed phy presece as well. Maintains existing behaviour
  in conditions where a fixed phy is not present.

Changes from V4:
- Restore #ifdef CONFIG_OF around setting of reset_gpio.
  Member doesn't exist when this isn't defined.

Changes from V3:
- Use if (IS_ENABLED(CONFIG_OF)) instead of #if.
  Reorder some code to reduce if statements.
- of_mdiobus_register already falls back to mdiobus_register
- Tested on system with CONFIG_OF

Changes from V2:
- Formatting, spaces & lines > 80 chars. Using checkpatch
- Drop PTP register debugfs patch.

Changes from V1:
- Fixed mismatch doc / code for ptp_ref_clk dt node.
- Remove unit address from doc example.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

3026043d

stmmac: socfpga: Provide dt node to config ptp clk source. · 43569814

Phil Reid authored Dec 14, 2015

Provides an options to use the ptp clock routed from the Altera FPGA
fabric. Instead of the defalt eosc1 clock connected to the ARM HPS core.
This setting affects all emacs in the core as the ptp clock is common.
Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: Phil Reid <preid@electromag.com.au>
Acked-by: Dinh Nguyen <dinguyen@opensource.altera.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

43569814

stmmac: Fix calculations for ptp counters when clock input = 50Mhz. · 19d857c9

Phil Reid authored Dec 14, 2015

stmmac_config_sub_second_increment set the sub second increment to 20ns.
Driver is configured to use the fine adjustment method where the sub second
register is incremented when the acculumator incremented by the addend
register wraps overflows. This accumulator is update on every ptp clk
cycle. If a ptp clk with a period of greater than 20ns was used the
sub second register would not get updated correctly.

Instead set the sub sec increment to twice the period of the ptp clk.
This result in the addend register being set mid range and overflow
the accumlator every 2 clock cycles.
Signed-off-by: Phil Reid <preid@electromag.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

19d857c9

stmmac: Correct documentation on stmmac clocks. · bf171f01

Phil Reid authored Dec 14, 2015

devm_get_clk looks in clock-name property for matching clock.
the ptp_ref_clk property is ignored.
Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: Phil Reid <preid@electromag.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

bf171f01

stmmac: create of compatible mdio bus for stmmac driver · e34d6569

Phil Reid authored Dec 14, 2015

The DSA driver needs to be passed a reference to an mdio bus. Typically
the mac is configured to use a fixed link but the mdio bus still needs
to be registered so that it con configure the switch.
This patch follows the same process as the altera tse ethernet driver for
creation of the mdio bus.
Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: Phil Reid <preid@electromag.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

e34d6569

15 Dec, 2015 24 commits

Merge branch 'end-of-ip-csum' · 93d085d2

David S. Miller authored Dec 15, 2015

Tom Herbert says:

====================
net: The beginning of the end for NETIF_F_IP_CSUM and NETIF_F_IPV6_CSUM

Background:

This patch set starts to address one front in the battle against
protocol ossification. Protocol ossification describes the state
that we have arrived at in the evolution of the Internet where we are
materially limited to only using a very narrow range of protocols
and protocol features. For instance, only TCP and UDP is sufficiently
supported on the Internet so that deploying alternative protocols,
such as SCTP and DCCP, are non-starters. Similarly, IP options and IPv6
extension headers are typically not considered feasible for wide
deployment, so we have loss the extensibility of IP protocols.

Protocol ossification is not only a problem on the Internet, but in
the data center as well. A root cause of this seems to be narrow,
protocol specific optimizations implemented in switches (for doing
EMCP) and in NICs (NIC offloads). These tend to be performance
optimization around TCP and UDP packets, and these have become
requirements to implement performant network solutions at scale.

Attempts to deal with protocol ossification in data center have yielded
ad hoc, sub-optimal solutions. A main driver of foo-over-UDP (e.g.
GRE/UDP, MPLS/UDP) is to leverage the existing EMCP and RSS support for
UDP by setting the source port as an entropy value. This has seen some
success, but the cost of additional overhead and layering limits its
usefulness.  An even more extreme solution is STT where non-TCP packets
are spoofed as TCP to leverage NIC offloads.

This patch set endeavours to address protocol ossification caused by
techniques used in transmit checksum offload for NICs. Future work
will address protocol ossification in the other primary NIC offloads--
namely receive checksum offload, LSO, LRO, and RSS.

NETIF_F_IP_CSUM and NETIF_F_IPV6_CSUM:

NETIF_F_IP_CSUM and NETIF_F_IPV6_CSUM exemplify the problem of protocol
ossification. These features are relics from a simpler time in the
Internet, before encapsulation, before GRE and  IPIP. Many hardware
vendors only saw the need to provide checksum offload for simple UDP and
TCP packets over IPv4 (IPv6 support is an afterthought also). In today's
Internet and data centers, checksum offload is well established as a
valuable feature, but we can no longer afford to be contsrained to
use a handful of protocols and features that are supported at the
discretion of NIC vendors. Generic and protocol agnostic methods are
needed.

The actual interface that the stack uses with drivers for checksum
offload is CHECKSUM_PARTIAL. This is a generic and protocol agnostic
interface. A driver for a device that supports this generic
interface advertises NETIF_F_HW_CSUM.

Goals of this patch set:

We propose that drivers advertise NETIF_F_HW_CSUM instead of protocol
specific values of NETIF_F_IP_CSUM and NETIF_F_IPV6_CSUM.  If the
driver's device is constrained (for instance it can only offlaod simple
IPv4 and IPv6 packets) then these constraints can be checked in the
transmit path and skb_checksum_help would be called for packets that the
driver is unable to offload. In order to facilitate this, we add some
helper functions that takes a specification argument indicating the
type of packets a device is able to offload. If a packet does not match
the specification, the helper function calls skb_checksum_help.

Benefits of this approach are:
  - Simplify the stack and clarify the interface for checksum offload
  - Encourage NIC vendors to implement the generic. protocol agnostic
    checksum offload methods in hardware
  - Encourage feature parity in NIC offloads for IPv4 and IPv6

Many drivers advertise NETIF_F_IP_CSUM and NETIF_F_IPV6_CSUM and it
probably isn't feasible to convert them all in a given time frame
(although if we could this would be a great simplification to the
stack). A reasonable direction may be to declare that new drivers must
use NETIF_F_HW_CSUM as NETIF_F_IP_CSUM and NETIF_F_IPV6_CSUM are
considered deprecated.

There is a class of drivers that should now be converted to advertise
NETIF_F_HW_CSUM, namely those that support offload of ecapsulated
checksums. These drivers have to date been using skb->encapsulation
to infer that checksum offload is being performed for an encapsulated
checksum. This is strictly not correct. skb->encapsulation
indicates that the inner headers are valid in the skbuff, whereas
the stack indicates checksum offload arguments exclusively in csum_start
and csum_offset. At some point we may want to set the inner headers for
an skbuff but offload the outer transport checksum, so this needs to be
fixed.

In this patch set:

  - Rename some of constants involved in checksum offload to be more
    reflective of their function
  - Eliminate NETIF_F_GEN_CSUM and NETIF_F_V[46]_CSUM entirely as
    unnecessary convolutions
  - Fix conditions in tcp_sendpage and tcp_sendmsg to take IP protocol
    into account when determining if checksum offload can be done
  - Add driver helper functions for determining if a checksum can
    be offloaded to a device. If not, the helper function can call
    skb_checksum_help
  - Document the checksum offload interface between the stack and
    drivers with detail and specifics

Testing:

Have been testing ixgbe and mlx4. No noticeable regressions seen yet.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

93d085d2

net: Elaborate on checksum offload interface description · 7a6ae71b

Tom Herbert authored Dec 14, 2015

Add specifics and details the description of the interface between
the stack and drivers for doing checksum offload. This description
is meant to be as specific and complete as possible.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7a6ae71b

net: Add driver helper functions to determine checksum offloadability · 6ae23ad3

Tom Herbert authored Dec 14, 2015

Add skb_csum_offload_chk driver helper function to determine if a
device with limited checksum offload capabilities is able to offload the
checksum for a given packet.

This patch includes:
  - The skb_csum_offload_chk function. Returns true if checksum is
    offloadable, else false. Optionally, in the case that the checksum
    is not offloable, the function can call skb_checksum_help to resolve
    the checksum. skb_csum_offload_chk also returns whether the checksum
    refers to an encapsulated checksum.
  - Definition of skb_csum_offl_spec structure that caller uses to
    indicate rules about what it can offload (e.g. IPv4/v6, TCP/UDP only,
    whether encapsulated checksums can be offloaded, whether checksum with
    IPv6 extension headers can be offloaded).
  - Ancilary functions called skb_csum_offload_chk_help,
    skb_csum_off_chk_help_cmn, skb_csum_off_chk_help_cmn_v4_only.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6ae23ad3

tcp: Fix conditions to determine checksum offload · 9a49850d

Tom Herbert authored Dec 14, 2015

In tcp_send_sendpage and tcp_sendmsg we check the route capabilities to
determine if checksum offload can be performed. This check currently
does not take the IP protocol into account for devices that advertise
only one of NETIF_F_IPV6_CSUM or NETIF_F_IP_CSUM. This patch adds a
function to check capabilities for checksum offload with a socket
called sk_check_csum_caps. This function checks for specific IPv4 or
IPv6 offload support based on the family of the socket.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9a49850d

net: Eliminate NETIF_F_GEN_CSUM and NETIF_F_V[46]_CSUM · c8cd0989

Tom Herbert authored Dec 14, 2015

These netif flags are unnecessary convolutions. It is more
straightforward to just use NETIF_F_HW_CSUM, NETIF_F_IP_CSUM,
and NETIF_F_IPV6_CSUM directly.

This patch also:
    - Cleans up can_checksum_protocol
    - Simplifies netdev_intersect_features
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c8cd0989

net: Rename NETIF_F_ALL_CSUM to NETIF_F_CSUM_MASK · a188222b

Tom Herbert authored Dec 14, 2015

The name NETIF_F_ALL_CSUM is a misnomer. This does not correspond to the
set of features for offloading all checksums. This is a mask of the
checksum offload related features bits. It is incorrect to set both
NETIF_F_HW_CSUM and NETIF_F_IP_CSUM or NETIF_F_IPV6 at the same time for
features of a device.

This patch:
  - Changes instances of NETIF_F_ALL_CSUM to NETIF_F_CSUM_MASK (where
    NETIF_F_ALL_CSUM is being used as a mask).
  - Changes bonding, sfc/efx, ipvlan, macvlan, vlan, and team drivers to
    use NEITF_F_HW_CSUM in features list instead of NETIF_F_ALL_CSUM.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a188222b

fcoe: Use CHECKSUM_PARTIAL to indicate CRC offload · 253aab05

Tom Herbert authored Dec 14, 2015

When setting up CRC offload set ip_summed to CHECKSUM_PARTIAL
instead of CHECKSUM_UNNECESSARY. This is consistent with the
definition of CHECKSUM_PARTIAL.

The only driver that seems to be advertising NETIF_F_FCOE_CRC is
ixgbe. AFICT the driver does not look at ip_summed for FCOE and
just assumes that CRC is being offloaded.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

253aab05

sctp: Rename NETIF_F_SCTP_CSUM to NETIF_F_SCTP_CRC · 53692b1d

Tom Herbert authored Dec 14, 2015

The SCTP checksum is really a CRC and is very different from the
standards 1's complement checksum that serves as the checksum
for IP protocols. This offload interface is also very different.
Rename NETIF_F_SCTP_CSUM to NETIF_F_SCTP_CRC to highlight these
differences. The term CSUM should be reserved in the stack to refer
to the standard 1's complement IP checksum.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

53692b1d

net: Add skb_inner_transport_offset function · 55dc5a9f

Tom Herbert authored Dec 14, 2015

Same thing as skb_transport_offset but returns the offset of the inner
transport header (when skb->encpasulation is set).
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

55dc5a9f

ravb: Add fixed-link support · b4bc88a8

Kazuya Mizuguchi authored Dec 15, 2015

This patch adds support of the fixed PHY.
This patch is based on commit 87009814 ("ucc_geth: use the new fixed
PHY helpers").
Signed-off-by: Kazuya Mizuguchi <kazuya.mizuguchi.ks@renesas.com>
Signed-off-by: Yoshihiro Kaneko <ykaneko0929@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b4bc88a8

Merge branch 'mlxsw-bridge-vlan-offloading' · a7159a3f

David S. Miller authored Dec 15, 2015

Ido Schimmel says:

====================
This patchset introduces support for the offloading of 802.1D bridges
between VLAN devices. These can either be VLAN devices configured on top
of the physical ports or on top of LAG devices.

Patches 1-2 deal with the necessary infrastructure changes needed in order
to enable the above. The main change is that switchdev drivers can now know
the device from which the switchdev op originated from.

Patches 3-10 lay the groundwork for 802.1D bridges support in the mlxsw
driver, with patch 4 doing most of the heavy lifting.

Patch 11 finally offloads these bridges to hardware by listening to the
notifications sent when the VLAN device joins or leaves a bridge. It is
very similar to the already existing 802.1Q bridge we support.

Patches 12-14 add minor modifications to allow one to bridge a VLAN device
configured on top of LAG.
====================
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a7159a3f

mlxsw: spectrum: Add support for VLAN devices on top of LAG · 272c4470

Ido Schimmel authored Dec 15, 2015

When creating a VLAN device on top of LAG, we are basically creating a
vPort on top of each of the port netdevs member in the LAG. Therefore,
these vPorts should inherit both the LAG status and LAG ID from the
underlying port netdevs.

In addition, when the VLAN device joins or leaves a bridge each of the
underlying vPorts should know about it and act accordingly. This is
achieved by propagating the VLAN event down to the lower devices.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

272c4470

mlxsw: spectrum: Enable FDB records for VLAN devices on top of LAG · 64771e31

Ido Schimmel authored Dec 15, 2015

When adding or removing FDB records of VLAN devices on top of LAG we
should set the lag_vid parameter to the VLAN ID of the VLAN device. It
is reserved otherwise.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

64771e31

mlxsw: reg: Add lag_vid field to SFD register · afd7f979

Ido Schimmel authored Dec 15, 2015

Unicast LAG records in the Switch Filtering Database (SFD) register have
a lag_vid field indicating the VLAN ID in case of vFIDs. This field is
no longer reserved since we are going to add support for VLAN devices on
top of LAG.

Add the lag_vid field to be used by VLAN devies on top of LAG.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

afd7f979

mlxsw: spectrum: Add support for VLAN devices bridging · 26f0e7fb

Ido Schimmel authored Dec 15, 2015

All the member VLAN devices in a bridge need to share the same vFID.

To achieve that, expand the vFID struct to include the associated bridge
device (or lack of) and allow one to lookup a vFID based on a bridge
device.

When joining a bridge, lookup the relevant vFID or create one if none
exists. Next, make the VLAN device use the vFID.

Leaving a bridge can either occur because a user removed the VLAN device
from a bridge or because the VLAN device was deleted by the user. In the
latter case the bridge's teardown sequence is invoked after the hardware
vPort is already gone. Therefore, when unlinking the VLAN device from
the real device, check if the associated vPort is bridged and act
accordingly. The bridge's notification will be ignored in this case.

Note that bridging a VLAN interface with an ordinary port netdev is
currently not supported, but not forbidden. This will be addressed in a
follow-up patchset.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

26f0e7fb

mlxsw: spectrum: Handle VLAN devices linking / unlinking · 9589a7b5

Ido Schimmel authored Dec 15, 2015

When a VLAN interface is configured on top of a physical port we should
associate the VLAN device with the matching vPort. Likewise, when it's
removed, we should revert back to the underlying port netdev.

While not a must, this is consistent with port netdevs and also provides
a more accurate error printing via netdev_err() and friends.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9589a7b5

mlxsw: spectrum: Adjust FDB notifications for VLAN devices · aac78a44

Ido Schimmel authored Dec 15, 2015

FDB notifications contain the FID and port (or LAG ID) on which the MAC
was learned. In the case of the 802.1Q bridge one can easily derive the
matching VID - as FID equals VID - and generate the appropriate
notification for the software bridge. With VLAN devices this is no
longer the case, as these are associated with a vFID.

Solve that by converting the FID to a vFID and lookup the matching VLAN
device. From that derive the VID and whether learning (and learning
sync) should occur.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

aac78a44

mlxsw: spectrum: Adjust switchdev ops for VLAN devices · 54a73201

Ido Schimmel authored Dec 15, 2015

switchdev ops can now be called for VLAN devices and we need to be
prepared for it. Until now they were only called for the port netdev.

Use the newly propagated orig_dev passed as part of the switchdev
attr/obj and determine whether the original device is a VLAN device. If
so, act accordingly, otherwise continue as usual.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

54a73201

mlxsw: spectrum: Use FID instead of VID when accessing FDB · 9de6a80e

Ido Schimmel authored Dec 15, 2015

In the Spectrum ASIC - unlike SwitchX-2 - FDB access is done by
specifying FID as parameter and not VID.

Change the relevant variables and parameters names to reflect that.

Note that this was OK up until now, since FID was always equal to VID,
but with the introduction of VLAN interfaces this is no longer the case.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9de6a80e

mlxsw: spectrum: Add another flood table for vFIDs · 19ae6124

Ido Schimmel authored Dec 15, 2015

We previously used only one flood table for packets classified to vFIDs.
However, since we are going to add support for bridges between VLAN
interfaces (mapped to vFIDs) we need to add one more flood table.

That way we can separate the flooding domain of unknown unicast traffic
from all the rest and support flood control (as we do with the 802.1Q
bridge).
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

19ae6124

mlxsw: spectrum: Use appropriate parameter name · c06a94ef

Ido Schimmel authored Dec 15, 2015

The __mlxsw_sp_port_flood_set function is now used to configure flooding
for both FIDs and vFIDs, so change the parameter name to 'idx' instead
of 'fid'. This is also consistent with hardware documentation.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c06a94ef

mlxsw: spectrum: Split vFID range in two · 7f71eb46

Ido Schimmel authored Dec 15, 2015

Up until now we used a 1:1 mapping - based on VID - to map a VLAN
interface to a vFID. However, a different scheme is needed in order to
support bridges between VLAN interfaces, as all the member interfaces -
which can have different VIDs - need to share the same vFID.

Solve that by splitting the vFID range in two:
 1. Non-bridged VLAN interfaces
 2. Bridged VLAN interfaces

When a VLAN interface is created, assign it the next available vFID in
the first range, unless one already exists for that VID or number of
vFIDs in the range was exceeded. When interface is removed, free the
vFID, unless other interfaces are mapped to it.

To accomplish the above:
 1. Store the VID to vFID mapping in a new struct (mlxsw_sp_vfid), which
    has a global context and holds a reference count.
 2. Create a vPort (dummy in case of bridge SELF invocation) on top of
    of the physical port and hold a reference to the associated vFID.

	     vfid                    vfid
	+-------------+	        +-------------+
	| vfid        |         | vfid        |
	| vid         +---> ... | vid         |
	| nr_vports   |         | nr_vports   |
	+------+------+         +------+------+
				       |
	       +-----------------------+-------+
	       |			       |
	     vport			     vport
	+-------------+         	+-------------+
	| ...	      |         	| ...	      |
	| *vfid	      +---> ... 	| *vfid	      +---> ...
	| ...	      |         	| ...	      |
	+------+------+         	+------+------+
	       |                               |
	     port			     port
	+-------------+         	+-------------+
	| ...         |         	| ...         |
	| vports_list |         	| vports_list |
	| ...         |         	| ...         |
	+-------------+         	+-------------+
	     swXpY			     swXpZ

Next patches in the series will add the missing infrastructure for the
second range and transfer vPorts between the two ranges according to the
received notifications.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7f71eb46

mlxsw: spectrum: Allocate active VLANs only for port netdevs · bd40e9d6

Ido Schimmel authored Dec 15, 2015

When adding support for bridges between VLAN interfaces, we'll introduce
a new entity called a vPort, which is a represntation of the VLAN
interface in the hardware.

The main difference between a vPort and a physical port is that several
FIDs can be bound to the latter, whereas only one (called a vFID) can be
bound to the first.

Therefore, it makes sense to use the same struct to represent the two,
but to only allocate the 'active_vlans' bitmap in case of a physical
port.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bd40e9d6

switchdev: Pass original device to port netdev driver · 6ff64f6f

Ido Schimmel authored Dec 15, 2015

switchdev drivers need to know the netdev on which the switchdev op was
invoked. For example, the STP state of a VLAN interface configured on top
of a port can change while being member in a bridge. In this case, the
underlying driver should only change the STP state of that particular
VLAN and not of all the VLANs configured on the port.

However, current switchdev infrastructure only passes the port netdev down
to the driver. Solve that by passing the original device down to the
driver as part of the required switchdev object / attribute.

This doesn't entail any change in current switchdev drivers. It simply
enables those supporting stacked devices to know the originating device
and act accordingly.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6ff64f6f