1. 16 Dec, 2015 23 commits
    • David S. Miller's avatar
      Merge branch 'geneve-udp-port-offload' · 897ca373
      David S. Miller authored
      Anjali Singhai Jain says:
      
      ====================
      Add support for Geneve udp port offload
      
      This patch series adds new ndo ops for Geneve add/del port, so as
      to help offload Geneve tunnel functionalities such as RX checksum,
      RSS, filters etc.
      
      i40e driver has been tested with the changes to make sure the offloads
      happen.
      
      We do understand that this is not the ideal solution and most likely
      will be redone with a more generic offload framework.
      But this certainly will enable us to start seeing benefits of the
      accelerations for Geneve tunnels.
      
      As a side note, we did find an existing issue in i40e driver where a
      service task can modify tunnel data structures with no locks held to
      help linearize access. A separate patch will be taking care of that issue.
      
      A question out to the community is regarding the driver Kconfig parameters
      for VxLAN and Geneve, it would be ideal to drop those if there is a way
      to help resolve vxlan/geneve_get_rx_port symbols while the tunnel modules
      are not loaded.
      
      Performance numbers:
      With the offloads enable on X722 devices with remote checksum enabled
      and no other tuning in terms of cpu governer etc on my test machine:
      
      With offload
      Throughput: 5527Mbits/sec with a single thread
      %cpu: ~43% per core with 4 threads
      
      Without offload
      Throughput: 2364Mbits/sec with a single thread
      %cpu: ~99% per core with 4 threads
      
      These numbers will get better for X722 as it is being worked. But
      this does bring out the delta in terms of when the stack is notified
      with csum_level 1 and CHECKSUM_UNNECESSARY vs not without the RX offload.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      897ca373
    • Singhai, Anjali's avatar
      i40e: Call geneve_get_rx_port to get the existing Geneve ports · cd866606
      Singhai, Anjali authored
      This patch adds a call to geneve_get_rx_port in i40e so that when it
      comes up it can learn about the existing geneve tunnels.
      Signed-off-by: default avatarAnjali Singhai Jain <anjali.singhai@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cd866606
    • Singhai, Anjali's avatar
      geneve: Add geneve_get_rx_port support · 05ca4029
      Singhai, Anjali authored
      This patch adds an op that the drivers can call into to get existing
      geneve ports.
      Signed-off-by: default avatarAnjali Singhai Jain <anjali.singhai@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      05ca4029
    • Singhai, Anjali's avatar
      i40e: Kernel dependency update for i40e to support geneve offload · c110c311
      Singhai, Anjali authored
      Update the Kconfig file with dependency for supporting GENEVE tunnel
      offloads.
      Signed-off-by: default avatarAnjali Singhai Jain <anjali.singhai@intel.com>
      Signed-off-by: default avatarKiran Patil <kiran.patil@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c110c311
    • Singhai, Anjali's avatar
      i40e: geneve tunnel offload support · 6a899024
      Singhai, Anjali authored
      This patch adds driver hooks to implement ndo_ops to add/del
      udp port in the HW to identify GENEVE tunnels.
      Signed-off-by: default avatarAnjali Singhai Jain <anjali.singhai@intel.com>
      Signed-off-by: default avatarKiran Patil <kiran.patil@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6a899024
    • Singhai, Anjali's avatar
      geneve: Add geneve udp port offload for ethernet devices · a8170d2b
      Singhai, Anjali authored
      Add ndo_ops to add/del UDP ports to a device that supports geneve
      offload.
      
      v2: Comment fix.
      Signed-off-by: default avatarAnjali Singhai Jain <anjali.singhai@intel.com>
      Signed-off-by: default avatarKiran Patil <kiran.patil@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a8170d2b
    • Zhu Yanjun's avatar
      net: sctp: dynamically enable or disable pf state · 566178f8
      Zhu Yanjun authored
      As we all know, the value of pf_retrans >= max_retrans_path can
      disable pf state. The variables of pf_retrans and max_retrans_path
      can be changed by the userspace application.
      
      Sometimes the user expects to disable pf state while the 2
      variables are changed to enable pf state. So it is necessary to
      introduce a new variable to disable pf state.
      
      According to the suggestions from Vlad Yasevich, extra1 and extra2
      are removed. The initialization of pf_enable is added.
      Acked-by: default avatarVlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: default avatarZhu Yanjun <zyjzyj2000@gmail.com>
      Acked-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      566178f8
    • Eric Dumazet's avatar
      sctp: use GFP_KERNEL in sctp_init() · 6857a02a
      Eric Dumazet authored
      modules init functions being called from process context, we better
      use GFP_KERNEL allocations to increase our chances to get these
      high-order pages we want for SCTP hash tables.
      
      This mostly matters if SCTP module is loaded once memory got fragmented.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6857a02a
    • David S. Miller's avatar
      Merge branch 'sock-diag-destroy' · 5cfe6d8a
      David S. Miller authored
      Lorenzo Colitti says:
      
      ====================
      Support administratively closing application sockets
      
      This patchset adds the ability to administratively close a socket
      without any action from the process owning the socket or the
      socket protocol.
      
      It implements this by adding a new diag_destroy function pointer
      to struct proto. In-kernel callers can access this functionality
      directly by calling sk->sk_prot->diag_destroy(sk, err).
      
      It also exposes this functionality to userspace via a new
      SOCK_DESTROY operation in the NETLINK_SOCK_DIAG sockets. This
      allows a privileged userspace process, such as a connection
      manager or system administration tool, to close sockets belonging
      to other apps when the network they were established on has
      disconnected. It is needed on laptops and mobile hosts to ensure
      that network switches / disconnects do not result in applications
      being blocked for long periods of time (minutes) in read or
      connect calls on TCP sockets that will never succeed because the
      IP address they are bound to is no longer on the system. Closing
      the sockets causes these calls to fail fast and allows the apps
      to reconnect on another network.
      
      Userspace intervention is necessary because in many cases the
      kernel does not have enough information to know that a connection
      is now inoperable. The kernel can know if a packet can't be
      routed, but in general it won't know if a TCP connection is stuck
      because it is now routed to a network where its source address is
      no longer valid [5][6].
      
      Many other operating systems offer similar functionality:
      
       - FreeBSD has had this since 5.4 in 2005 [2]. It is available
         to privileged userspace and there is a tool to use it [3].
       - The FreeBSD commit description states that the idea came
         from OpenBSD.
       - iOS has been administratively closing app sockets since
         iOS 4 - see [4], which states that a socket "might get
         reclaimed by the kernel" and after that will return EBADF].
         For many years Android kernels have supported this via an
         out-of-tree SIOCKILLADDR ioctl that is called on every
         RTM_DELADDR event, but this solution is cleaner, more robust
         and more flexible: the connection manager can iterate over all
         connections on the deleted IP address and close all of them.
         It can also be used to close all sockets opened by a given app
         process, for example if the user has restricted that app from
         using the network, if a secure network such as a VPN has
         connected and security policy requires all of an application's
         connections to be routed via the VPN, etc.
       - For many years Android kernels have supported an out-of-tree
         SIOCKILLADDR ioctl that is called when a network disconnects
         or an RTM_DELADDR event is received. This solution is cleaner,
         more robust and more flexible. The connection manager can
         implement SIOCKILLADDR by iterating over all connections on
         the deleted IP address and close all of them, but it can also
         close all sockets opened by a given app process (for example
         if the user has restricted that app from), close all of a
         user's TCP connections if a user has connected a secure
         network such as a VPN and expects all of an application's
         connections to be routed via the VPN, etc.
      
      Alternative schemes such as TCP keepalives in combination with
      "iptables -j REJECT --reject-with tcp-reset", could be used to
      achieve similar results, but on mobile devices TCP keepalives are
      very expensive, and in such a scheme detecting stuck connections
      has to wait for a keepalive to be sent or the application to
      perform a write. An explicit notification from userspace is
      cheaper and faster in the common case where an application is
      blocked on read.
      
      SOCK_DESTROY is placed behind an INET_DIAG_DESTROY configuration
      option, which is currently off by default.
      
      The TCP implementation of diag_destroy causes a TCP ABORT as
      specified by RFC 793 [1]: immediately send a RST and clear local
      connection state. This is what happens today if an application
      enables SO_LINGER with a timeout of 0 and then calls close.
      
      The first versions of the patchset did not send a RST, but that
      is not graceful/correct TCP behaviour. tcp_abort now does a
      proper RFC 793 ABORT and sends a RST to the peer. This is
      consistent with BSD's tcpdrop, and is more correct in general,
      even though in many use cases tcp_abort will only be called when
      sending a RST is no longer possible (e.g., the network has
      disconnected).
      
      The original patchset also behaved like SIOCKILADDR and closed
      TCP sockets with ETIMEDOUT. Tom Herbert pointed out that it would
      be better if applications could distinguish between a timeout and
      an administrative close. ECONNABORTED was chosen because it is
      consistent with BSD.
      
      [1] http://tools.ietf.org/html/rfc793#page-50
      [2] http://svnweb.freebsd.org/base?view=revision&revision=141381
      [3] https://www.freebsd.org/cgi/man.cgi?query=tcpdrop&sektion=8&manpath=FreeBSD+5.4-RELEASE
      [4] https://developer.apple.com/library/ios/technotes/tn2277/_index.html#//apple_ref/doc/uid/DTS40010841-CH1-SUBSECTION3
      [5] http://www.spinics.net/lists/netdev/msg352775.html
      [6] http://www.spinics.net/lists/netdev/msg352952.html
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5cfe6d8a
    • Lorenzo Colitti's avatar
      net: diag: Support destroying TCP sockets. · c1e64e29
      Lorenzo Colitti authored
      This implements SOCK_DESTROY for TCP sockets. It causes all
      blocking calls on the socket to fail fast with ECONNABORTED and
      causes a protocol close of the socket. It informs the other end
      of the connection by sending a RST, i.e., initiating a TCP ABORT
      as per RFC 793. ECONNABORTED was chosen for consistency with
      FreeBSD.
      Signed-off-by: default avatarLorenzo Colitti <lorenzo@google.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c1e64e29
    • Lorenzo Colitti's avatar
      net: diag: Support SOCK_DESTROY for inet sockets. · 6eb5d2e0
      Lorenzo Colitti authored
      This passes the SOCK_DESTROY operation to the underlying protocol
      diag handler, or returns -EOPNOTSUPP if that handler does not
      define a destroy operation.
      
      Most of this patch is just renaming functions. This is not
      strictly necessary, but it would be fairly counterintuitive to
      have the code to destroy inet sockets be in a function whose name
      starts with inet_diag_get.
      Signed-off-by: default avatarLorenzo Colitti <lorenzo@google.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6eb5d2e0
    • Lorenzo Colitti's avatar
      net: diag: Add the ability to destroy a socket. · 64be0aed
      Lorenzo Colitti authored
      This patch adds a SOCK_DESTROY operation, a destroy function
      pointer to sock_diag_handler, and a diag_destroy function
      pointer.  It does not include any implementation code.
      Signed-off-by: default avatarLorenzo Colitti <lorenzo@google.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      64be0aed
    • Lorenzo Colitti's avatar
      net: diag: split inet_diag_dump_one_icsk into two · b613f56e
      Lorenzo Colitti authored
      Currently, inet_diag_dump_one_icsk finds a socket and then dumps
      its information to userspace. Split it into a part that finds the
      socket and a part that dumps the information.
      Signed-off-by: default avatarLorenzo Colitti <lorenzo@google.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b613f56e
    • David S. Miller's avatar
      Merge branch 'ila-early-demux' · fec65bd4
      David S. Miller authored
      Tom Herbert says:
      
      ====================
      ila: Optimization to preserve value of early demux
      
      In the current implementation of ILA, LWT is used to perform
      translation on both the input and output paths. This is functional,
      however there is a big performance hit in the receive path. Early
      demux occurs before the routing lookup (a hit actually obviates the
      route lookup). Therefore the stack currently performs early
      demux before translation so that a local connection with ILA
      addresses is never matched. Note that this issue is not just
      with ILA, but pretty much any translated or encapsulated packet
      handled by LWT would miss the opportunity for early demux. Solving
      the general problem seems non trivial since we would need to move
      the route lookup before early demx thereby mitigating the value.
      
      This patch set addresses the issue for ILA by adding a fast locator
      lookup that occurs before early demux. This done by hooking in to
      NF_INET_PRE_ROUTING
      
      For the backend we implement an rhashtable that contains identifier
      to locator to mappings. The table also allows more specific matches
      that include original locator and interface.
      
      This patch set:
       - Add an rhashtable function to atomically replace and element.
         This is useful to implement sub-trees from a table entry
         without needing to use a special anchor structure as the
         table entry.
       - Add a start callback for starting a netlink dump.
       - Creates an ila directory under net/ipv6 and moves ila.c to it.
         ila.c is split into ila_common.c and ila_lwt.c.
       - Implement a table to do identifier->locator mapping. This is
         an rhashtable (in ila_xlat.c).
       - Configuration for the table with netlink.
       - Add a hook into NF_INET_PRE_ROUTING to perform ILA translation
         before early demux.
      
      Changes in v2:
       - Use iptables targets instead of a new xfrm function
      
      Changes in v3:
       - Add __rcu to next pointer in struct ila_map
      
      Changes in v4:
       - Use hook for NF_INET_PRE_ROUTING
      
      Changed in v5:
       - Register hooks per namespace using nf_register_net_hooks
       - Only register hooks when first mapping is actually added
      
      Changed in v6:
        - Remove gfp argument in alloc_ila_locks, it is unnecessary
        - Set registered_hooks properly when hooks are registered
      
      Testing:
         Running 200 netperf TCP_RR streams
      
      No ILA, baseline
         79.26% CPU utilization
         1678282 tps
         104/189/390 50/90/99% latencies
      
      ILA before fix (LWT on both input and output)
         81.91% CPU utilization
         1464723 tps (-14.5% from baseline)
         121/215/411 50/90/99% latencies
      
      ILA after fix
         80.62% CPU utilization
         1622985 (-3.4% from baseline)
         110/191/347 50/90/99% latencies
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fec65bd4
    • Tom Herbert's avatar
      ila: Add generic ILA translation facility · 7f00feaf
      Tom Herbert authored
      This patch implements an ILA tanslation table. This table can be
      configured with identifier to locator mappings, and can be be queried
      to resolve a mapping. Queries can be parameterized based on interface,
      direction (incoming or outoing), and matching locator.  The table is
      implemented using rhashtable and is configured via netlink (through
      "ip ila .." in iproute).
      
      The table may be used as alternative means to do do ILA tanslations
      other than the lw tunnels
      Signed-off-by: default avatarTom Herbert <tom@herbertland.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7f00feaf
    • Tom Herbert's avatar
      netlink: add a start callback for starting a netlink dump · fc9e50f5
      Tom Herbert authored
      The start callback allows the caller to set up a context for the
      dump callbacks. Presumably, the context can then be destroyed in
      the done callback.
      Signed-off-by: default avatarTom Herbert <tom@herbertland.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fc9e50f5
    • Tom Herbert's avatar
      rhashtable: add function to replace an element · 3502cad7
      Tom Herbert authored
      Add the rhashtable_replace_fast function. This replaces one object in
      the table with another atomically. The hashes of the new and old objects
      must be equal.
      Signed-off-by: default avatarTom Herbert <tom@herbertland.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3502cad7
    • Tom Herbert's avatar
      ila: Create net/ipv6/ila directory · 33f11d16
      Tom Herbert authored
      Create ila directory in preparation for supporting other hooks in the
      kernel than LWT for doing ILA. This includes:
        - Moving ila.c to ila/ila_lwt.c
        - Splitting out some common functions into ila_common.c
      Signed-off-by: default avatarTom Herbert <tom@herbertland.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      33f11d16
    • David S. Miller's avatar
      Merge branch 'stmmac-mdio-compat' · 3026043d
      David S. Miller authored
      Merge branch 'stmmac-mdio-compat'
      
      Phil Reid says:
      
      ====================
      stmmac: create of compatible mdio bus for stmacc driver
      
      Provide ability to specify a fixed phy in the device tree and
      retain the mdio bus if no phy is found. This is needed where
      a dsa is connected via a fixed phy and uses the mdio bus for config.
      Fixed ptp ref clock calculatins for the stmmac when ptp ref clock
      is running at <= 50Mhz. Also add device tree setting to config
      ptp clk source on socfpga platforms.
      
      Changes from V5:
      - Restore behaviour of unregister mdio bus when no phys found
        if there is no device tree node create the bus.
      - Modify condition to allocate mdio_base_data conditional
        on fixed phy presece as well. Maintains existing behaviour
        in conditions where a fixed phy is not present.
      
      Changes from V4:
      - Restore #ifdef CONFIG_OF around setting of reset_gpio.
        Member doesn't exist when this isn't defined.
      
      Changes from V3:
      - Use if (IS_ENABLED(CONFIG_OF)) instead of #if.
        Reorder some code to reduce if statements.
      - of_mdiobus_register already falls back to mdiobus_register
      - Tested on system with CONFIG_OF
      
      Changes from V2:
      - Formatting, spaces & lines > 80 chars. Using checkpatch
      - Drop PTP register debugfs patch.
      
      Changes from V1:
      - Fixed mismatch doc / code for ptp_ref_clk dt node.
      - Remove unit address from doc example.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3026043d
    • Phil Reid's avatar
      stmmac: socfpga: Provide dt node to config ptp clk source. · 43569814
      Phil Reid authored
      Provides an options to use the ptp clock routed from the Altera FPGA
      fabric. Instead of the defalt eosc1 clock connected to the ARM HPS core.
      This setting affects all emacs in the core as the ptp clock is common.
      Acked-by: default avatarRob Herring <robh@kernel.org>
      Signed-off-by: default avatarPhil Reid <preid@electromag.com.au>
      Acked-by: default avatarDinh Nguyen <dinguyen@opensource.altera.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      43569814
    • Phil Reid's avatar
      stmmac: Fix calculations for ptp counters when clock input = 50Mhz. · 19d857c9
      Phil Reid authored
      stmmac_config_sub_second_increment set the sub second increment to 20ns.
      Driver is configured to use the fine adjustment method where the sub second
      register is incremented when the acculumator incremented by the addend
      register wraps overflows. This accumulator is update on every ptp clk
      cycle. If a ptp clk with a period of greater than 20ns was used the
      sub second register would not get updated correctly.
      
      Instead set the sub sec increment to twice the period of the ptp clk.
      This result in the addend register being set mid range and overflow
      the accumlator every 2 clock cycles.
      Signed-off-by: default avatarPhil Reid <preid@electromag.com.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      19d857c9
    • Phil Reid's avatar
      stmmac: Correct documentation on stmmac clocks. · bf171f01
      Phil Reid authored
      devm_get_clk looks in clock-name property for matching clock.
      the ptp_ref_clk property is ignored.
      Acked-by: default avatarRob Herring <robh@kernel.org>
      Signed-off-by: default avatarPhil Reid <preid@electromag.com.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bf171f01
    • Phil Reid's avatar
      stmmac: create of compatible mdio bus for stmmac driver · e34d6569
      Phil Reid authored
      The DSA driver needs to be passed a reference to an mdio bus. Typically
      the mac is configured to use a fixed link but the mdio bus still needs
      to be registered so that it con configure the switch.
      This patch follows the same process as the altera tse ethernet driver for
      creation of the mdio bus.
      Acked-by: default avatarRob Herring <robh@kernel.org>
      Signed-off-by: default avatarPhil Reid <preid@electromag.com.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e34d6569
  2. 15 Dec, 2015 17 commits
    • David S. Miller's avatar
      Merge branch 'end-of-ip-csum' · 93d085d2
      David S. Miller authored
      Tom Herbert says:
      
      ====================
      net: The beginning of the end for NETIF_F_IP_CSUM and NETIF_F_IPV6_CSUM
      
      Background:
      
      This patch set starts to address one front in the battle against
      protocol ossification. Protocol ossification describes the state
      that we have arrived at in the evolution of the Internet where we are
      materially limited to only using a very narrow range of protocols
      and protocol features. For instance, only TCP and UDP is sufficiently
      supported on the Internet so that deploying alternative protocols,
      such as SCTP and DCCP, are non-starters. Similarly, IP options and IPv6
      extension headers are typically not considered feasible for wide
      deployment, so we have loss the extensibility of IP protocols.
      
      Protocol ossification is not only a problem on the Internet, but in
      the data center as well. A root cause of this seems to be narrow,
      protocol specific optimizations implemented in switches (for doing
      EMCP) and in NICs (NIC offloads). These tend to be performance
      optimization around TCP and UDP packets, and these have become
      requirements to implement performant network solutions at scale.
      
      Attempts to deal with protocol ossification in data center have yielded
      ad hoc, sub-optimal solutions. A main driver of foo-over-UDP (e.g.
      GRE/UDP, MPLS/UDP) is to leverage the existing EMCP and RSS support for
      UDP by setting the source port as an entropy value. This has seen some
      success, but the cost of additional overhead and layering limits its
      usefulness.  An even more extreme solution is STT where non-TCP packets
      are spoofed as TCP to leverage NIC offloads.
      
      This patch set endeavours to address protocol ossification caused by
      techniques used in transmit checksum offload for NICs. Future work
      will address protocol ossification in the other primary NIC offloads--
      namely receive checksum offload, LSO, LRO, and RSS.
      
      NETIF_F_IP_CSUM and NETIF_F_IPV6_CSUM:
      
      NETIF_F_IP_CSUM and NETIF_F_IPV6_CSUM exemplify the problem of protocol
      ossification. These features are relics from a simpler time in the
      Internet, before encapsulation, before GRE and  IPIP. Many hardware
      vendors only saw the need to provide checksum offload for simple UDP and
      TCP packets over IPv4 (IPv6 support is an afterthought also). In today's
      Internet and data centers, checksum offload is well established as a
      valuable feature, but we can no longer afford to be contsrained to
      use a handful of protocols and features that are supported at the
      discretion of NIC vendors. Generic and protocol agnostic methods are
      needed.
      
      The actual interface that the stack uses with drivers for checksum
      offload is CHECKSUM_PARTIAL. This is a generic and protocol agnostic
      interface. A driver for a device that supports this generic
      interface advertises NETIF_F_HW_CSUM.
      
      Goals of this patch set:
      
      We propose that drivers advertise NETIF_F_HW_CSUM instead of protocol
      specific values of NETIF_F_IP_CSUM and NETIF_F_IPV6_CSUM.  If the
      driver's device is constrained (for instance it can only offlaod simple
      IPv4 and IPv6 packets) then these constraints can be checked in the
      transmit path and skb_checksum_help would be called for packets that the
      driver is unable to offload. In order to facilitate this, we add some
      helper functions that takes a specification argument indicating the
      type of packets a device is able to offload. If a packet does not match
      the specification, the helper function calls skb_checksum_help.
      
      Benefits of this approach are:
        - Simplify the stack and clarify the interface for checksum offload
        - Encourage NIC vendors to implement the generic. protocol agnostic
          checksum offload methods in hardware
        - Encourage feature parity in NIC offloads for IPv4 and IPv6
      
      Many drivers advertise NETIF_F_IP_CSUM and NETIF_F_IPV6_CSUM and it
      probably isn't feasible to convert them all in a given time frame
      (although if we could this would be a great simplification to the
      stack). A reasonable direction may be to declare that new drivers must
      use NETIF_F_HW_CSUM as NETIF_F_IP_CSUM and NETIF_F_IPV6_CSUM are
      considered deprecated.
      
      There is a class of drivers that should now be converted to advertise
      NETIF_F_HW_CSUM, namely those that support offload of ecapsulated
      checksums. These drivers have to date been using skb->encapsulation
      to infer that checksum offload is being performed for an encapsulated
      checksum. This is strictly not correct. skb->encapsulation
      indicates that the inner headers are valid in the skbuff, whereas
      the stack indicates checksum offload arguments exclusively in csum_start
      and csum_offset. At some point we may want to set the inner headers for
      an skbuff but offload the outer transport checksum, so this needs to be
      fixed.
      
      In this patch set:
      
        - Rename some of constants involved in checksum offload to be more
          reflective of their function
        - Eliminate NETIF_F_GEN_CSUM and NETIF_F_V[46]_CSUM entirely as
          unnecessary convolutions
        - Fix conditions in tcp_sendpage and tcp_sendmsg to take IP protocol
          into account when determining if checksum offload can be done
        - Add driver helper functions for determining if a checksum can
          be offloaded to a device. If not, the helper function can call
          skb_checksum_help
        - Document the checksum offload interface between the stack and
          drivers with detail and specifics
      
      Testing:
      
      Have been testing ixgbe and mlx4. No noticeable regressions seen yet.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      93d085d2
    • Tom Herbert's avatar
      net: Elaborate on checksum offload interface description · 7a6ae71b
      Tom Herbert authored
      Add specifics and details the description of the interface between
      the stack and drivers for doing checksum offload. This description
      is meant to be as specific and complete as possible.
      Signed-off-by: default avatarTom Herbert <tom@herbertland.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7a6ae71b
    • Tom Herbert's avatar
      net: Add driver helper functions to determine checksum offloadability · 6ae23ad3
      Tom Herbert authored
      Add skb_csum_offload_chk driver helper function to determine if a
      device with limited checksum offload capabilities is able to offload the
      checksum for a given packet.
      
      This patch includes:
        - The skb_csum_offload_chk function. Returns true if checksum is
          offloadable, else false. Optionally, in the case that the checksum
          is not offloable, the function can call skb_checksum_help to resolve
          the checksum. skb_csum_offload_chk also returns whether the checksum
          refers to an encapsulated checksum.
        - Definition of skb_csum_offl_spec structure that caller uses to
          indicate rules about what it can offload (e.g. IPv4/v6, TCP/UDP only,
          whether encapsulated checksums can be offloaded, whether checksum with
          IPv6 extension headers can be offloaded).
        - Ancilary functions called skb_csum_offload_chk_help,
          skb_csum_off_chk_help_cmn, skb_csum_off_chk_help_cmn_v4_only.
      Signed-off-by: default avatarTom Herbert <tom@herbertland.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6ae23ad3
    • Tom Herbert's avatar
      tcp: Fix conditions to determine checksum offload · 9a49850d
      Tom Herbert authored
      In tcp_send_sendpage and tcp_sendmsg we check the route capabilities to
      determine if checksum offload can be performed. This check currently
      does not take the IP protocol into account for devices that advertise
      only one of NETIF_F_IPV6_CSUM or NETIF_F_IP_CSUM. This patch adds a
      function to check capabilities for checksum offload with a socket
      called sk_check_csum_caps. This function checks for specific IPv4 or
      IPv6 offload support based on the family of the socket.
      Signed-off-by: default avatarTom Herbert <tom@herbertland.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9a49850d
    • Tom Herbert's avatar
      net: Eliminate NETIF_F_GEN_CSUM and NETIF_F_V[46]_CSUM · c8cd0989
      Tom Herbert authored
      These netif flags are unnecessary convolutions. It is more
      straightforward to just use NETIF_F_HW_CSUM, NETIF_F_IP_CSUM,
      and NETIF_F_IPV6_CSUM directly.
      
      This patch also:
          - Cleans up can_checksum_protocol
          - Simplifies netdev_intersect_features
      Signed-off-by: default avatarTom Herbert <tom@herbertland.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c8cd0989
    • Tom Herbert's avatar
      net: Rename NETIF_F_ALL_CSUM to NETIF_F_CSUM_MASK · a188222b
      Tom Herbert authored
      The name NETIF_F_ALL_CSUM is a misnomer. This does not correspond to the
      set of features for offloading all checksums. This is a mask of the
      checksum offload related features bits. It is incorrect to set both
      NETIF_F_HW_CSUM and NETIF_F_IP_CSUM or NETIF_F_IPV6 at the same time for
      features of a device.
      
      This patch:
        - Changes instances of NETIF_F_ALL_CSUM to NETIF_F_CSUM_MASK (where
          NETIF_F_ALL_CSUM is being used as a mask).
        - Changes bonding, sfc/efx, ipvlan, macvlan, vlan, and team drivers to
          use NEITF_F_HW_CSUM in features list instead of NETIF_F_ALL_CSUM.
      Signed-off-by: default avatarTom Herbert <tom@herbertland.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a188222b
    • Tom Herbert's avatar
      fcoe: Use CHECKSUM_PARTIAL to indicate CRC offload · 253aab05
      Tom Herbert authored
      When setting up CRC offload set ip_summed to CHECKSUM_PARTIAL
      instead of CHECKSUM_UNNECESSARY. This is consistent with the
      definition of CHECKSUM_PARTIAL.
      
      The only driver that seems to be advertising NETIF_F_FCOE_CRC is
      ixgbe. AFICT the driver does not look at ip_summed for FCOE and
      just assumes that CRC is being offloaded.
      Signed-off-by: default avatarTom Herbert <tom@herbertland.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      253aab05
    • Tom Herbert's avatar
      sctp: Rename NETIF_F_SCTP_CSUM to NETIF_F_SCTP_CRC · 53692b1d
      Tom Herbert authored
      The SCTP checksum is really a CRC and is very different from the
      standards 1's complement checksum that serves as the checksum
      for IP protocols. This offload interface is also very different.
      Rename NETIF_F_SCTP_CSUM to NETIF_F_SCTP_CRC to highlight these
      differences. The term CSUM should be reserved in the stack to refer
      to the standard 1's complement IP checksum.
      Signed-off-by: default avatarTom Herbert <tom@herbertland.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      53692b1d
    • Tom Herbert's avatar
      net: Add skb_inner_transport_offset function · 55dc5a9f
      Tom Herbert authored
      Same thing as skb_transport_offset but returns the offset of the inner
      transport header (when skb->encpasulation is set).
      Signed-off-by: default avatarTom Herbert <tom@herbertland.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      55dc5a9f
    • Kazuya Mizuguchi's avatar
      ravb: Add fixed-link support · b4bc88a8
      Kazuya Mizuguchi authored
      This patch adds support of the fixed PHY.
      This patch is based on commit 87009814 ("ucc_geth: use the new fixed
      PHY helpers").
      Signed-off-by: default avatarKazuya Mizuguchi <kazuya.mizuguchi.ks@renesas.com>
      Signed-off-by: default avatarYoshihiro Kaneko <ykaneko0929@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b4bc88a8
    • David S. Miller's avatar
      Merge branch 'mlxsw-bridge-vlan-offloading' · a7159a3f
      David S. Miller authored
      Ido Schimmel says:
      
      ====================
      This patchset introduces support for the offloading of 802.1D bridges
      between VLAN devices. These can either be VLAN devices configured on top
      of the physical ports or on top of LAG devices.
      
      Patches 1-2 deal with the necessary infrastructure changes needed in order
      to enable the above. The main change is that switchdev drivers can now know
      the device from which the switchdev op originated from.
      
      Patches 3-10 lay the groundwork for 802.1D bridges support in the mlxsw
      driver, with patch 4 doing most of the heavy lifting.
      
      Patch 11 finally offloads these bridges to hardware by listening to the
      notifications sent when the VLAN device joins or leaves a bridge. It is
      very similar to the already existing 802.1Q bridge we support.
      
      Patches 12-14 add minor modifications to allow one to bridge a VLAN device
      configured on top of LAG.
      ====================
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a7159a3f
    • Ido Schimmel's avatar
      mlxsw: spectrum: Add support for VLAN devices on top of LAG · 272c4470
      Ido Schimmel authored
      When creating a VLAN device on top of LAG, we are basically creating a
      vPort on top of each of the port netdevs member in the LAG. Therefore,
      these vPorts should inherit both the LAG status and LAG ID from the
      underlying port netdevs.
      
      In addition, when the VLAN device joins or leaves a bridge each of the
      underlying vPorts should know about it and act accordingly. This is
      achieved by propagating the VLAN event down to the lower devices.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      272c4470
    • Ido Schimmel's avatar
      mlxsw: spectrum: Enable FDB records for VLAN devices on top of LAG · 64771e31
      Ido Schimmel authored
      When adding or removing FDB records of VLAN devices on top of LAG we
      should set the lag_vid parameter to the VLAN ID of the VLAN device. It
      is reserved otherwise.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      64771e31
    • Ido Schimmel's avatar
      mlxsw: reg: Add lag_vid field to SFD register · afd7f979
      Ido Schimmel authored
      Unicast LAG records in the Switch Filtering Database (SFD) register have
      a lag_vid field indicating the VLAN ID in case of vFIDs. This field is
      no longer reserved since we are going to add support for VLAN devices on
      top of LAG.
      
      Add the lag_vid field to be used by VLAN devies on top of LAG.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      afd7f979
    • Ido Schimmel's avatar
      mlxsw: spectrum: Add support for VLAN devices bridging · 26f0e7fb
      Ido Schimmel authored
      All the member VLAN devices in a bridge need to share the same vFID.
      
      To achieve that, expand the vFID struct to include the associated bridge
      device (or lack of) and allow one to lookup a vFID based on a bridge
      device.
      
      When joining a bridge, lookup the relevant vFID or create one if none
      exists. Next, make the VLAN device use the vFID.
      
      Leaving a bridge can either occur because a user removed the VLAN device
      from a bridge or because the VLAN device was deleted by the user. In the
      latter case the bridge's teardown sequence is invoked after the hardware
      vPort is already gone. Therefore, when unlinking the VLAN device from
      the real device, check if the associated vPort is bridged and act
      accordingly. The bridge's notification will be ignored in this case.
      
      Note that bridging a VLAN interface with an ordinary port netdev is
      currently not supported, but not forbidden. This will be addressed in a
      follow-up patchset.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      26f0e7fb
    • Ido Schimmel's avatar
      mlxsw: spectrum: Handle VLAN devices linking / unlinking · 9589a7b5
      Ido Schimmel authored
      When a VLAN interface is configured on top of a physical port we should
      associate the VLAN device with the matching vPort. Likewise, when it's
      removed, we should revert back to the underlying port netdev.
      
      While not a must, this is consistent with port netdevs and also provides
      a more accurate error printing via netdev_err() and friends.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9589a7b5
    • Ido Schimmel's avatar
      mlxsw: spectrum: Adjust FDB notifications for VLAN devices · aac78a44
      Ido Schimmel authored
      FDB notifications contain the FID and port (or LAG ID) on which the MAC
      was learned. In the case of the 802.1Q bridge one can easily derive the
      matching VID - as FID equals VID - and generate the appropriate
      notification for the software bridge. With VLAN devices this is no
      longer the case, as these are associated with a vFID.
      
      Solve that by converting the FID to a vFID and lookup the matching VLAN
      device. From that derive the VID and whether learning (and learning
      sync) should occur.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      aac78a44