1. 05 Oct, 2015 35 commits
    • David S. Miller's avatar
      Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/net-next · 40e10680
      David S. Miller authored
      Eric W. Biederman says:
      
      ====================
      net: Pass net through ip fragmention
      
      This is the next installment of my work to pass struct net through the
      output path so the code does not need to guess how to figure out which
      network namespace it is in, and ultimately routes can have output
      devices in another network namespace.
      
      This round focuses on passing net through ip fragmentation which we seem
      to call from about everywhere.  That is the main ip output paths, the
      bridge netfilter code, and openvswitch.  This has to happend at once
      accross the tree as function pointers are involved.
      
      First some prep work is done, then ipv4 and ipv6 are converted and then
      temporary helper functions are removed.
      ====================
      Acked-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      40e10680
    • David S. Miller's avatar
      Merge branch 'rds-perf' · 7e2832f1
      David S. Miller authored
      Sowmini Varadhan says:
      
      ====================
      RDS: RDS-TCP perf enhancements
      
      A 3-part patchset that (a) improves current RDS-TCP perf
      by 2X-3X and (b) refactors earlier robustness code for
      better observability/scaling.
      
      Patch 1 is an enhancment of earlier robustness fixes
      that had used separate sockets for client and server endpoints to
      resolve race conditions. It is possible to have an equivalent
      solution that does not use 2 sockets. The benefit of a
      single socket solution is that it results in more predictable
      and observable behavior for the underlying TCP pipe of an
      RDS connection
      
      Patches 2 and 3 are simple, straightforward perf bug fixes
      that align the RDS TCP socket with other parts of the kernel stack.
      
      v2: fix kbuild-test-robot warnings, comments from  Sergei Shtylov
          and Santosh Shilimkar.
      ====================
      Acked-by: default avatarSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7e2832f1
    • Sowmini Varadhan's avatar
      RDS-TCP: Set up MSG_MORE and MSG_SENDPAGE_NOTLAST as appropriate in rds_tcp_xmit · 76b29ef1
      Sowmini Varadhan authored
      For the same reasons as commit 2f533844 ("tcp: allow splice() to
      build full TSO packets") and commit 35f9c09f ("tcp: tcp_sendpages()
      should call tcp_push() once"), rds_tcp_xmit may have multiple pages to
      send, so use the MSG_MORE and MSG_SENDPAGE_NOTLAST as hints to
      tcp_sendpage()
      Signed-off-by: default avatarSowmini Varadhan <sowmini.varadhan@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      76b29ef1
    • Sowmini Varadhan's avatar
      RDS-TCP: Do not bloat sndbuf/rcvbuf in rds_tcp_tune · 1edd6a14
      Sowmini Varadhan authored
      Using the value of RDS_TCP_DEFAULT_BUFSIZE (128K)
      clobbers efficient use of TSO because it inflates the size_goal
      that is computed in tcp_sendmsg/tcp_sendpage and skews packet
      latency, and the default values for these parameters actually
      results in significantly better performance.
      
      In request-response tests using rds-stress with a packet size of
      100K with 16 threads (test parameters -q 100000 -a 256 -t16 -d16)
      between a single pair of IP addresses achieves a throughput of
      6-8 Gbps. Without this patch, throughput maxes at 2-3 Gbps under
      equivalent conditions on these platforms.
      Signed-off-by: default avatarSowmini Varadhan <sowmini.varadhan@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1edd6a14
    • Sowmini Varadhan's avatar
      RDS: Use a single TCP socket for both send and receive. · 3b20fc38
      Sowmini Varadhan authored
      Commit f711a6ae ("net/rds: RDS-TCP: Always create a new rds_sock
      for an incoming connection.") modified rds-tcp so that an incoming SYN
      would ignore an existing "client" TCP connection which had the local
      port set to the transient port.  The motivation for ignoring the existing
      "client" connection in f711a6ae was to avoid race conditions and an
      endless duel of reconnect attempts triggered by a restart/abort of one
      of the nodes in the TCP connection.
      
      However, having separate sockets for active and passive sides
      is avoidable, and the simpler model of a single TCP socket for
      both send and receives of all RDS connections associated with
      that tcp socket makes for easier observability. We avoid the race
      conditions from f711a6ae by attempting reconnects in rds_conn_shutdown
      if, and only if, the (new) c_outgoing bit is set for RDS_TRANS_TCP.
      The c_outgoing bit is initialized in __rds_conn_create().
      
      A side-effect of re-using the client rds_connection for an incoming
      SYN is the potential of encountering duelling SYNs, i.e., we
      have an outgoing RDS_CONN_CONNECTING socket when we get the incoming
      SYN. The logic to arbitrate this criss-crossing SYN exchange in
      rds_tcp_accept_one() has been modified to emulate the BGP state
      machine: the smaller IP address should back off from the connection attempt.
      Signed-off-by: default avatarSowmini Varadhan <sowmini.varadhan@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3b20fc38
    • David S. Miller's avatar
      Merge branch 'xgbe-next' · 393159e9
      David S. Miller authored
      Tom Lendacky says:
      
      ====================
      amd-xgbe: AMD XGBE driver updates 2015-09-30
      
      The following patches are included in this driver update series:
      
      - Remove unneeded semi-colon
      - Follow the DT/ACPI precedence used by the device_ APIs
      - Add ethtool support for getting and setting the msglevel
      - Add ethtool support error and debug messages
      - Simplify the hardware FIFO assignment calculations
      - Add receive buffer unavailable statistic
      - Use the device workqueue instead of the system workqueue
      - Remove the use of a link state bit
      
      This patch series is based on net-next.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      393159e9
    • Lendacky, Thomas's avatar
      amd-xgbe: Remove the XGBE_LINK state bit · 50789845
      Lendacky, Thomas authored
      The XGBE_LINK bit is used just to determine whether to call the
      netif_carrier_on/off functions. Rather than define and use this bit,
      just call the functions. The netif_carrier_ok function can be used in
      place of checking the XGBE_LINK bit in the future.
      Signed-off-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      50789845
    • Lendacky, Thomas's avatar
      amd-xgbe: Use device workqueue instead of system workqueue · afb43e8a
      Lendacky, Thomas authored
      The driver creates, flushes and destroys a device workqueue but queues
      work to the system workqueue. Switch from using the system workqueue to
      the device workqueue.
      Signed-off-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      afb43e8a
    • Lendacky, Thomas's avatar
      amd-xgbe: Add receive buffer unavailable statistic · 72c9ac4e
      Lendacky, Thomas authored
      Add a statistic that tracks how many times an interrupt is generated for
      a receive buffer not being available to the hardware which prevents the
      hardware from being able to DMA the received data.
      Signed-off-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      72c9ac4e
    • Lendacky, Thomas's avatar
      amd-xgbe: Simplify calculation and setting of queue fifos · 9c439e4b
      Lendacky, Thomas authored
      The calculation of the Tx and Rx fifo sizes can be calculated rather
      than hardcoded in a switch statement. Additionally, the per-queue fifo
      sizes can be calculated rather than hardcoded using if/else if statements
      that can possibly underutilize the available fifo area.
      
      Change the code to calculate the fifo sizes and the per-queue fifo sizes
      to simplify the code and make best use of the available fifo.
      Signed-off-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9c439e4b
    • Lendacky, Thomas's avatar
      amd-xgbe: Add ethtool error and debug messages · e5dd8b81
      Lendacky, Thomas authored
      Add error and dynamic debug messages to various ethtool functions in
      the driver while also removing the DBGPR debug print calls. Also, change
      the message level for some error messages from alert to err.
      Signed-off-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e5dd8b81
    • Lendacky, Thomas's avatar
      amd-xgbe: Add ethtool support for setting the msglevel · 349fb2d7
      Lendacky, Thomas authored
      Provide the ethtool functions to support getting and setting the
      msglevel for the driver.
      Signed-off-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      349fb2d7
    • Lendacky, Thomas's avatar
      amd-xgbe: Use proper DT / ACPI precedence checking · 47f2e6c2
      Lendacky, Thomas authored
      Device tree presence takes precedence over ACPI in the device_* APIs.
      The amd-xgbe driver should follow the same precedence. Update the check
      on whether to use DT / ACPI to follow this.
      Signed-off-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      47f2e6c2
    • Lendacky, Thomas's avatar
      amd-xgbe: Remove an unneeded semicolon on a switch statement · 3947d78a
      Lendacky, Thomas authored
      Remove an unneeded semicolon at the end of a switch statement block.
      Reported-by: default avatarkbuild test robot <fengguang.wu@intel.com>
      Signed-off-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3947d78a
    • Eric Dumazet's avatar
      tcp: restore fastopen operations · ac8cfc7b
      Eric Dumazet authored
      I accidentally cleared fastopenq.max_qlen in reqsk_queue_alloc()
      while max_qlen can be set before listen() is called,
      using TCP_FASTOPEN socket option for example.
      
      Fixes: 0536fcc0 ("tcp: prepare fastopen code for upcoming listener changes")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ac8cfc7b
    • David S. Miller's avatar
      Merge branch 'net-y2038' · 77946de5
      David S. Miller authored
      Arnd Bergmann says:
      
      ====================
      net: assorted y2038 changes
      
      This is a set of changes for network drivers and core code to
      get rid of the use of time_t and derived data structures.
      
      I have a longer set of patches that enables me to build kernels
      with the time_t definition removed completely as a help to find
      y2038 overflow issues. This is the subset for networking that
      contains all code that has a reasonable way of fixing at the
      moment and that is either commonly used (in one of the defconfigs)
      or that blocks building a whole subsystem.
      
      Most of the patches in this series should be noncontroversial,
      but the last two that I marked [RFC] are a bit tricky and
      need input from people that are more familiar with the code than
      I am. All 12 patches are independent of one another and can
      be applied in any order, so feel free to pick all that look
      good.
      
      Patches that are not included here are:
      
       - disabling less common device drivers that I don't have a fix
         for yet, this includes
      	drivers/net/ethernet/brocade/bna/bfa_ioc.c
      	drivers/net/ethernet/qlogic/netxen/netxen_nic_hw.c
      	drivers/net/ethernet/tile/tilegx.c
      	drivers/net/hamradio/baycom_ser_fdx.c
      	drivers/net/wireless/ath/ath10k/core.h
      	drivers/net/wireless/ath/ath9k/
      	drivers/net/wireless/ath/ath9k/
      	drivers/net/wireless/atmel.c
      	drivers/net/wireless/prism54/isl_38xx.c
      	drivers/net/wireless/rt2x00/rt2x00debug.c
      	drivers/net/wireless/rtlwifi/
      	drivers/net/wireless/ti/wlcore/
      	drivers/staging/ozwpan/
      	net/atm/mpoa_caches.c
      	net/atm/mpoa_proc.c
      	net/dccp/probe.c
      	net/ipv4/tcp_probe.c
      	net/netfilter/nfnetlink_queue_core.c
      	net/netfilter/nfnetlink_queue_core.c
      	net/netfilter/xt_time.c
      	net/openvswitch/flow.c
      	net/sctp/probe.c
      	net/sunrpc/auth_gss/
      	net/sunrpc/svcauth_unix.c
      	net/vmw_vsock/af_vsock.c
         We'll get there eventually, or we an add a dependency to ensure
         they are not built on 32-bit kernels that need to survive
         beyond 2038. Most of these should be really easy to fix.
      
       - recvmmsg/sendmmsg system calls: patches have been sent out
         as part of the syscall series, need a little more work and
         review
      
       - SIOCGSTAMP/SIOCGSTAMPNS/ ioctl calls: tricky, need to discuss
         with some folks at kernel summit
      
       - SO_RCVTIMEO/SO_SNDTIMEO/SO_TIMESTAMP/SO_TIMESTAMPNS socket
         opt: similar and related to the ioctl
      
       - mmapped packet socket: need to create v4 of the API, nontrivial
      
       - pktgen: sends 32-bit timestamps over network, need to find out
         if using unsigned stamps is good enough
      
       - af_rxpc: similar to pktgen, uses 32-bit times for deadlines
      
       - ppp ioctl: patch is being worked on, nontrivial but doable
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      77946de5
    • Arnd Bergmann's avatar
      net: sctp: avoid incorrect time_t use · 3ef0a25b
      Arnd Bergmann authored
      We want to avoid using time_t in the kernel because of the y2038
      overflow problem. The use in sctp is not for storing seconds at
      all, but instead uses microseconds and is passed as 32-bit
      on all machines.
      
      This patch changes the type to u32, which better fits the use.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: Vlad Yasevich <vyasevich@gmail.com>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      Cc: linux-sctp@vger.kernel.org
      Acked-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Acked-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3ef0a25b
    • Arnd Bergmann's avatar
      ipv6: use ktime_t for internal timestamps · 3dd7669f
      Arnd Bergmann authored
      The ipv6 mip6 implementation is one of only a few users of the
      skb_get_timestamp() function in the kernel, which is both unsafe
      on 32-bit architectures because of the 2038 overflow, and slightly
      less efficient than the skb_get_ktime() based approach.
      
      This converts the function call and the mip6_report_rate_limiter
      structure that stores the time stamp, eliminating all uses of
      timeval in the ipv6 code.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: James Morris <jmorris@namei.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3dd7669f
    • Arnd Bergmann's avatar
      nfnetlink: use y2038 safe timestamp · f6389ecb
      Arnd Bergmann authored
      The __build_packet_message function fills a nfulnl_msg_packet_timestamp
      structure that uses 64-bit seconds and is therefore y2038 safe, but
      it uses an intermediate 'struct timespec' which is not.
      
      This trivially changes the code to use 'struct timespec64' instead,
      to correct the result on 32-bit architectures.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: Pablo Neira Ayuso <pablo@netfilter.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
      Cc: netfilter-devel@vger.kernel.org
      Cc: coreteam@netfilter.org
      Acked-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f6389ecb
    • Arnd Bergmann's avatar
      atm: remove 'struct zatm_t_hist' · 70ba07b6
      Arnd Bergmann authored
      The zatm_t_hist structure is not used anywhere in the kernel, but is
      exported to user space. As we are trying to eliminate uses of time_t
      in the kernel for y2038 compatibility, the current definition triggers
      checking tools because it contains 'struct timeval'.
      
      As pointed out by Chas Williams, the only user of this structure was
      the ZATM_GETHIST ioctl command that has been removed a long time ago,
      and we can remove the structure as well without breaking any user
      space.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: Chas Williams <3chas3@gmail.com>
      Cc: linux-atm-general@lists.sourceforge.net
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      70ba07b6
    • Arnd Bergmann's avatar
      mac80211: use ktime_get_seconds · 84b00607
      Arnd Bergmann authored
      The mac80211 code uses ktime_get_ts to measure the connected time.
      As this uses monotonic time, it is y2038 safe on 32-bit systems,
      but we still want to deprecate the use of 'timespec' because most
      other users are broken.
      
      This changes the code to use ktime_get_seconds() instead, which
      avoids the timespec structure and is slightly more efficient.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: Johannes Berg <johannes@sipsolutions.net>
      Cc: linux-wireless@vger.kernel.org
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      84b00607
    • Arnd Bergmann's avatar
      mwifiex: avoid gettimeofday in ba_threshold setting · 52f4f918
      Arnd Bergmann authored
      mwifiex_get_random_ba_threshold() uses a complex homegrown implementation
      to generate a pseudo-random number from the current time as returned
      from do_gettimeofday().
      
      This currently requires two 32-bit divisions plus a couple of other
      computations that are eventually discarded as only eight bits of
      the microsecond portion are used at all.
      
      We could replace this with a call to get_random_bytes(), but that
      might drain the entropy pool too fast if this is called for each
      packet.
      
      Instead, this patch converts it to use ktime_get_ns(), which is a
      bit faster than do_gettimeofday(), and then uses a similar algorithm
      as before, but in a way that takes both the nanosecond and second
      portion into account for slightly-more-but-still-not-very-random
      pseudorandom number.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: Amitkumar Karwar <akarwar@marvell.com>
      Cc: Nishant Sarmukadam <nishants@marvell.com>
      Cc: Kalle Valo <kvalo@codeaurora.org>
      Cc: linux-wireless@vger.kernel.org
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      52f4f918
    • Arnd Bergmann's avatar
      mwifiex: use ktime_get_real for timestamping · e253fb74
      Arnd Bergmann authored
      The mwifiex_11n_aggregate_pkt() function creates a ktime_t from
      a timeval returned by do_gettimeofday, which is slow and causes
      an overflow in 2038 on 32-bit architectures.
      
      This solves both problems by using the appropriate ktime_get_real()
      function.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: Amitkumar Karwar <akarwar@marvell.com>
      Cc: Nishant Sarmukadam <nishants@marvell.com>
      Cc: Kalle Valo <kvalo@codeaurora.org>
      Cc: linux-wireless@vger.kernel.org
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e253fb74
    • Arnd Bergmann's avatar
      net: igb: avoid using timespec · 40c9b079
      Arnd Bergmann authored
      We want to deprecate the use of 'struct timespec' on 32-bit
      architectures, as it is will overflow in 2038. The igb
      driver uses it to read the current time, and can simply
      be changed to use ktime_get_real_ts64() instead.
      
      Because of hardware limitations, there is still an overflow
      in year 2106, which we cannot really avoid, but this documents
      the overflow.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Cc: intel-wired-lan@lists.osuosl.org
      Reviewed-by: default avatarRichard Cochran <richardcochran@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      40c9b079
    • Arnd Bergmann's avatar
      net: stmmac: avoid using timespec · 0a624155
      Arnd Bergmann authored
      We want to deprecate the use of 'struct timespec' on 32-bit
      architectures, as it is will overflow in 2038. The stmmac
      driver uses it to read the current time, and can simply
      be changed to use ktime_get_real_ts64() instead.
      
      Because of hardware limitations, there is still an overflow
      in year 2106, which we cannot really avoid, but this documents
      the overflow.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Acked-by: default avatarRichard Cochran <richardcochran@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0a624155
    • Arnd Bergmann's avatar
      net: fec: avoid timespec use · be7ccdc3
      Arnd Bergmann authored
      The fec_ptp_enable_pps uses an open-coded implementation of ns_to_timespec,
      which will be removed eventually as it is not y2038-safe on 32-bit
      architectures. Two more instances of the same code in this file were
      already converted to use the safe ns_to_timespec64 in commit 6630514f
      ("ptp: fec: use helpers for converting ns to timespec"), this changes
      the last one as well.
      
      The seconds portion here is actually unused and we could just remove the
      timespec variable, but using ns_to_timespec64 can still be better as the
      implementation can be hand-optimized in the future.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Fugang Duan <b38611@freescale.com>
      Cc: Luwei Zhou <b45643@freescale.com>
      Cc: Frank Li <Frank.Li@freescale.com>
      Acked-by: default avatarRichard Cochran <richardcochran@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      be7ccdc3
    • David S. Miller's avatar
      Merge branch 'ipv4-multipath-hash' · 07355737
      David S. Miller authored
      Peter Nørlund says:
      
      ====================
      ipv4: Hash-based multipath routing
      
      When the routing cache was removed in 3.6, the IPv4 multipath algorithm changed
      from more or less being destination-based into being quasi-random per-packet
      scheduling. This increases the risk of out-of-order packets and makes it
      impossible to use multipath together with anycast services.
      
      This patch series replaces the old implementation with flow-based load
      balancing based on a hash over the source and destination addresses.
      
      Distribution of the hash is done with thresholds as described in RFC 2992.
      This reduces the disruption when a path is added/remove when having more than
      two paths.
      
      To futher the chance of successful usage in conjuction with anycast, ICMP
      error packets are hashed over the inner IP addresses. This ensures that PMTU
      will work together with anycast or load-balancers such as IPVS.
      
      Port numbers are not considered since fragments could cause problems with
      anycast and IPVS. Relying on the DF-flag for TCP packets is also insufficient,
      since ICMP inspection effectively extracts information from the opposite
      flow which might have a different state of the DF-flag. This is also why the
      RSS hash is not used. These are typically based on the NDIS RSS spec which
      mandates TCP support.
      
      Measurements of the additional overhead of a two-path multipath
      (p_mkroute_input excl. __mkroute_input) on a Xeon X3550 (4 cores, 2.66GHz):
      
      Original per-packet: ~394 cycles/packet
      L3 hash:              ~76 cycles/packet
      
      Changes in v5:
      - Fixed compilation error
      
      Changes in v4:
      - Functions take hash directly instead of func ptr
      - Added inline hash function
      - Added dummy macros to minimize ifdefs
      - Use upper 31 bits of hash instead of lower
      
      Changes in v3:
      - Multipath algorithm is no longer configurable (always L3)
      - Added random seed to hash
      - Moved ICMP inspection to isolated function
      - Ignore source quench packets (deprecated as per RFC 6633)
      
      Changes in v2:
      - Replaced 8-bit xor hash with 31-bit jenkins hash
      - Don't scale weights (since 31-bit)
      - Avoided unnecesary renaming of variables
      - Rely on DF-bit instead of fragment offset when checking for fragmentation
      - upper_bound is now inclusive to avoid overflow
      - Use a callback to postpone extracting flow information until necessary
      - Skipped ICMP inspection entirely with L4 hashing
      - Handle newly added sysctl ignore_routes_with_linkdown
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      07355737
    • Peter Nørlund's avatar
      ipv4: ICMP packet inspection for multipath · 79a13159
      Peter Nørlund authored
      ICMP packets are inspected to let them route together with the flow they
      belong to, minimizing the chance that a problematic path will affect flows
      on other paths, and so that anycast environments can work with ECMP.
      Signed-off-by: default avatarPeter Nørlund <pch@ordbogen.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      79a13159
    • Peter Nørlund's avatar
      ipv4: L3 hash-based multipath · 0e884c78
      Peter Nørlund authored
      Replaces the per-packet multipath with a hash-based multipath using
      source and destination address.
      Signed-off-by: default avatarPeter Nørlund <pch@ordbogen.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0e884c78
    • David S. Miller's avatar
      Merge branch 'tcp-listener-fixes-and-improvement' · 2472186f
      David S. Miller authored
      Eric Dumazet says:
      
      ====================
      tcp: lockless listener fixes and improvement
      
      This fixes issues with TCP FastOpen vs lockless listeners,
      and SYNACK being attached to request sockets.
      
      Then, last patch brings performance improvement for
      syncookies generation and validation.
      
      Tested under a 4.3 Mpps SYNFLOOD attack, new perf profile looks
      like :
          12.11%  [kernel]  [k] sha_transform
           5.83%  [kernel]  [k] tcp_conn_request
           4.59%  [kernel]  [k] __inet_lookup_listener
           4.11%  [kernel]  [k] ipt_do_table
           3.91%  [kernel]  [k] tcp_make_synack
           3.05%  [kernel]  [k] fib_table_lookup
           2.74%  [kernel]  [k] sock_wfree
           2.66%  [kernel]  [k] memcpy_erms
           2.12%  [kernel]  [k] tcp_v4_rcv
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2472186f
    • Eric Dumazet's avatar
      tcp: avoid two atomic ops for syncookies · a1a5344d
      Eric Dumazet authored
      inet_reqsk_alloc() is used to allocate a temporary request
      in order to generate a SYNACK with a cookie. Then later,
      syncookie validation also uses a temporary request.
      
      These paths already took a reference on listener refcount,
      we can avoid a couple of atomic operations.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a1a5344d
    • Eric Dumazet's avatar
      net: use sk_fullsock() in __netdev_pick_tx() · 004a5d01
      Eric Dumazet authored
      SYN_RECV & TIMEWAIT sockets are not full blown, they do not have a
      sk_dst_cache pointer.
      
      Fixes: ca6fb065 ("tcp: attach SYNACK messages to request sockets instead of listener")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      004a5d01
    • Eric Dumazet's avatar
      ipv6: inet6_sk() should use sk_fullsock() · e7eadb4d
      Eric Dumazet authored
      SYN_RECV & TIMEWAIT sockets are not full blown, they do not have a pinet6
      pointer.
      
      Fixes: ca6fb065 ("tcp: attach SYNACK messages to request sockets instead of listener")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e7eadb4d
    • Eric Dumazet's avatar
      inet: ip_skb_dst_mtu() should use sk_fullsock() · caf3f267
      Eric Dumazet authored
      SYN_RECV & TIMEWAIT sockets are not full blown,
      do not even try to call ip_sk_use_pmtu() on them.
      
      Fixes: ca6fb065 ("tcp: attach SYNACK messages to request sockets instead of listener")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      caf3f267
    • Eric Dumazet's avatar
      tcp: fix fastopen races vs lockless listener · 7656d842
      Eric Dumazet authored
      There are multiple races that need fixes :
      
      1) skb_get() + queue skb + kfree_skb() is racy
      
      An accept() can be done on another cpu, data consumed immediately.
      tcp_recvmsg() uses __kfree_skb() as it is assumed all skb found in
      socket receive queue are private.
      
      Then the kfree_skb() in tcp_rcv_state_process() uses an already freed skb
      
      2) tcp_reqsk_record_syn() needs to be done before tcp_try_fastopen()
      for the same reasons.
      
      3) We want to send the SYNACK before queueing child into accept queue,
      otherwise we might reintroduce the ooo issue fixed in
      commit 7c85af88 ("tcp: avoid reorders for TFO passive connections")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7656d842
  2. 04 Oct, 2015 5 commits