1. 05 Oct, 2015 9 commits
    • David S. Miller's avatar
      Merge branch 'ipv4-multipath-hash' · 07355737
      David S. Miller authored
      Peter Nørlund says:
      
      ====================
      ipv4: Hash-based multipath routing
      
      When the routing cache was removed in 3.6, the IPv4 multipath algorithm changed
      from more or less being destination-based into being quasi-random per-packet
      scheduling. This increases the risk of out-of-order packets and makes it
      impossible to use multipath together with anycast services.
      
      This patch series replaces the old implementation with flow-based load
      balancing based on a hash over the source and destination addresses.
      
      Distribution of the hash is done with thresholds as described in RFC 2992.
      This reduces the disruption when a path is added/remove when having more than
      two paths.
      
      To futher the chance of successful usage in conjuction with anycast, ICMP
      error packets are hashed over the inner IP addresses. This ensures that PMTU
      will work together with anycast or load-balancers such as IPVS.
      
      Port numbers are not considered since fragments could cause problems with
      anycast and IPVS. Relying on the DF-flag for TCP packets is also insufficient,
      since ICMP inspection effectively extracts information from the opposite
      flow which might have a different state of the DF-flag. This is also why the
      RSS hash is not used. These are typically based on the NDIS RSS spec which
      mandates TCP support.
      
      Measurements of the additional overhead of a two-path multipath
      (p_mkroute_input excl. __mkroute_input) on a Xeon X3550 (4 cores, 2.66GHz):
      
      Original per-packet: ~394 cycles/packet
      L3 hash:              ~76 cycles/packet
      
      Changes in v5:
      - Fixed compilation error
      
      Changes in v4:
      - Functions take hash directly instead of func ptr
      - Added inline hash function
      - Added dummy macros to minimize ifdefs
      - Use upper 31 bits of hash instead of lower
      
      Changes in v3:
      - Multipath algorithm is no longer configurable (always L3)
      - Added random seed to hash
      - Moved ICMP inspection to isolated function
      - Ignore source quench packets (deprecated as per RFC 6633)
      
      Changes in v2:
      - Replaced 8-bit xor hash with 31-bit jenkins hash
      - Don't scale weights (since 31-bit)
      - Avoided unnecesary renaming of variables
      - Rely on DF-bit instead of fragment offset when checking for fragmentation
      - upper_bound is now inclusive to avoid overflow
      - Use a callback to postpone extracting flow information until necessary
      - Skipped ICMP inspection entirely with L4 hashing
      - Handle newly added sysctl ignore_routes_with_linkdown
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      07355737
    • Peter Nørlund's avatar
      ipv4: ICMP packet inspection for multipath · 79a13159
      Peter Nørlund authored
      ICMP packets are inspected to let them route together with the flow they
      belong to, minimizing the chance that a problematic path will affect flows
      on other paths, and so that anycast environments can work with ECMP.
      Signed-off-by: default avatarPeter Nørlund <pch@ordbogen.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      79a13159
    • Peter Nørlund's avatar
      ipv4: L3 hash-based multipath · 0e884c78
      Peter Nørlund authored
      Replaces the per-packet multipath with a hash-based multipath using
      source and destination address.
      Signed-off-by: default avatarPeter Nørlund <pch@ordbogen.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0e884c78
    • David S. Miller's avatar
      Merge branch 'tcp-listener-fixes-and-improvement' · 2472186f
      David S. Miller authored
      Eric Dumazet says:
      
      ====================
      tcp: lockless listener fixes and improvement
      
      This fixes issues with TCP FastOpen vs lockless listeners,
      and SYNACK being attached to request sockets.
      
      Then, last patch brings performance improvement for
      syncookies generation and validation.
      
      Tested under a 4.3 Mpps SYNFLOOD attack, new perf profile looks
      like :
          12.11%  [kernel]  [k] sha_transform
           5.83%  [kernel]  [k] tcp_conn_request
           4.59%  [kernel]  [k] __inet_lookup_listener
           4.11%  [kernel]  [k] ipt_do_table
           3.91%  [kernel]  [k] tcp_make_synack
           3.05%  [kernel]  [k] fib_table_lookup
           2.74%  [kernel]  [k] sock_wfree
           2.66%  [kernel]  [k] memcpy_erms
           2.12%  [kernel]  [k] tcp_v4_rcv
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2472186f
    • Eric Dumazet's avatar
      tcp: avoid two atomic ops for syncookies · a1a5344d
      Eric Dumazet authored
      inet_reqsk_alloc() is used to allocate a temporary request
      in order to generate a SYNACK with a cookie. Then later,
      syncookie validation also uses a temporary request.
      
      These paths already took a reference on listener refcount,
      we can avoid a couple of atomic operations.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a1a5344d
    • Eric Dumazet's avatar
      net: use sk_fullsock() in __netdev_pick_tx() · 004a5d01
      Eric Dumazet authored
      SYN_RECV & TIMEWAIT sockets are not full blown, they do not have a
      sk_dst_cache pointer.
      
      Fixes: ca6fb065 ("tcp: attach SYNACK messages to request sockets instead of listener")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      004a5d01
    • Eric Dumazet's avatar
      ipv6: inet6_sk() should use sk_fullsock() · e7eadb4d
      Eric Dumazet authored
      SYN_RECV & TIMEWAIT sockets are not full blown, they do not have a pinet6
      pointer.
      
      Fixes: ca6fb065 ("tcp: attach SYNACK messages to request sockets instead of listener")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e7eadb4d
    • Eric Dumazet's avatar
      inet: ip_skb_dst_mtu() should use sk_fullsock() · caf3f267
      Eric Dumazet authored
      SYN_RECV & TIMEWAIT sockets are not full blown,
      do not even try to call ip_sk_use_pmtu() on them.
      
      Fixes: ca6fb065 ("tcp: attach SYNACK messages to request sockets instead of listener")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      caf3f267
    • Eric Dumazet's avatar
      tcp: fix fastopen races vs lockless listener · 7656d842
      Eric Dumazet authored
      There are multiple races that need fixes :
      
      1) skb_get() + queue skb + kfree_skb() is racy
      
      An accept() can be done on another cpu, data consumed immediately.
      tcp_recvmsg() uses __kfree_skb() as it is assumed all skb found in
      socket receive queue are private.
      
      Then the kfree_skb() in tcp_rcv_state_process() uses an already freed skb
      
      2) tcp_reqsk_record_syn() needs to be done before tcp_try_fastopen()
      for the same reasons.
      
      3) We want to send the SYNACK before queueing child into accept queue,
      otherwise we might reintroduce the ooo issue fixed in
      commit 7c85af88 ("tcp: avoid reorders for TFO passive connections")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7656d842
  2. 04 Oct, 2015 26 commits
  3. 03 Oct, 2015 5 commits
    • Eric Dumazet's avatar
      tcp/dccp: add SLAB_DESTROY_BY_RCU flag for request sockets · e96f78ab
      Eric Dumazet authored
      Before letting request sockets being put in TCP/DCCP regular
      ehash table, we need to add either :
      
      - SLAB_DESTROY_BY_RCU flag to their kmem_cache
      - add RCU grace period before freeing them.
      
      Since we carefully respected the SLAB_DESTROY_BY_RCU protocol
      like ESTABLISH and TIMEWAIT sockets, use it here.
      
      req_prot_init() being only used by TCP and DCCP, I did not add
      a new slab_flags into their rsk_prot, but reuse prot->slab_flags
      
      Since all reqsk_alloc() users are correctly dealing with a failure,
      add the __GFP_NOWARN flag to avoid traces under pressure.
      
      Fixes: 079096f1 ("tcp/dccp: install syn_recv requests into ehash table")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e96f78ab
    • David S. Miller's avatar
      Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue · 4236e2a1
      David S. Miller authored
      Jeff Kirsher says:
      
      ====================
      Intel Wired LAN Driver Updates 2015-09-30
      
      This series contains updates to i40e and i40evf only.
      
      Vasily Averin provides a couple of rtnl lock/unlock fixes for both i40e
      and i40evf.
      
      Shannon provides several updates and fixes, first fixes up a type clash
      in i40e_aq_rc_to_posix(), where the error codes are signed values, so we
      need to treat them as such.  Then fixes up a padding issue where an
      extra byte is added in i40e_aqc_get_cee_dcb_cfg_v1_resp to directly
      acknowledge the padding.  Updated i40e to keep debugfs register read
      and writes from accessing outside of the io-remapped space.  Added
      support and device id for another 20 GbE device.
      
      Jesse fixes the transmit hand workaround code for ARM that was causing
      Tx hangs to still occur occasionally when there really was no hang.  Then
      fixed the receive dropped counter to show up in netstat interface.
      Refactor the interrupt enable function since it was always making the
      caller add the base_vector from the VSI struct which is already passed
      to the function.  Fix kbuild warnings found in 0day build infrastructure
      by adding a harmless cast to a dev_info(), also fix 32 bit build
      warnings found by sparse.
      
      Greg fixed a configuration error that results if a port VLAN is set
      for a VF before the VF driver is loaded, so that when the VF driver is
      loaded the port VLAN is ignored.
      
      Mitch fixes the use of QOS field consistently in
      i40e_ndo_set_vf_port_vlan().  Modified the init timing of the driver
      to increase stability on load/unload and SR-IOV enable/disable cycles.
      
      Anjali updates i40e to not collect VEB stats if they are disabled in the
      hardware for performance reasons.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4236e2a1
    • David S. Miller's avatar
      Merge branch 'ravb-r8a7795' · 28117b08
      David S. Miller authored
      Simon Horman says:
      
      ====================
      ravb: Add support for r8a7795 SoC
      
      please consider this series for net-next.
      It enhances the ravb driver to support the r8a7795 SoC.
      
      Changes:
      
      * Dropped RFC prefix
      * Details in changelog of individual patches
      
      Base:
      
      * net-next/master
      
      Availability:
      
      To aid review of this in conjunction with other EtherAVB changes
      the following branches are available in my renesas tree on kernel.org.
      
      * me/r8a7795-ravb-driver-v4: this series
      * me/r8a7795-ravb-pfc-v2: r8a7795 sh-pfc update for EthernetAVB
      * me/r8a7795-ravb-integration-v4: enable EthernetAVB on r8a7795
      * me/r8a7795-ravb-driver-and-integration-v4.runtime:
            the above three branches with their runtime dependencies
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      28117b08
    • Kazuya Mizuguchi's avatar
      ravb: Add support for r8a7795 SoC · 22d4df8f
      Kazuya Mizuguchi authored
      This patch supports the r8a7795 SoC by:
      - Using two interrupts
        + One for E-MAC
        + One for everything else
        + Both can be handled by the existing common interrupt handler, which
          affords a simpler update to support the new SoC. In future some
          consideration may be given to implementing multiple interrupt handlers
      - Limiting the phy speed to 100Mbit/s for the new SoC;
        at this time it is not clear how this restriction may be lifted
        but I hope it will be possible as more information comes to light
      Signed-off-by: default avatarKazuya Mizuguchi <kazuya.mizuguchi.ks@renesas.com>
      [horms: reworked]
      Signed-off-by: default avatarSimon Horman <horms+renesas@verge.net.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      22d4df8f
    • Kazuya Mizuguchi's avatar
      ravb: Document binding for r8a7795 SoC · 619f3bd2
      Kazuya Mizuguchi authored
      This patch updates the ravb binding to support the r8a7795 SoC by:
      - Adding a compat string for the new hardware
      - Adding 25 named interrupts to binding for the new SoC;
        older SoCs continue to use a single multiplexed interrupt
      
      The example is also updated to reflect the r8a7795 as this is the
      more complex case.
      
      Based on work by Kazuya Mizuguchi and others.
      Signed-off-by: default avatarSimon Horman <horms+renesas@verge.net.au>
      Acked-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      619f3bd2