1. 11 Nov, 2014 8 commits
    • David S. Miller's avatar
      Merge branch 'so_incoming_cpu' · b00394c0
      David S. Miller authored
      Eric Dumazet says:
      
      ====================
      net: SO_INCOMING_CPU support
      
      SO_INCOMING_CPU socket option (read by getsockopt()) provides
      an alternative to RPS/RFS for high performance servers using
      multi queues NIC.
      
      TCP should use sk_mark_napi_id() for established sockets only.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b00394c0
    • Eric Dumazet's avatar
      net: introduce SO_INCOMING_CPU · 2c8c56e1
      Eric Dumazet authored
      Alternative to RPS/RFS is to use hardware support for multiple
      queues.
      
      Then split a set of million of sockets into worker threads, each
      one using epoll() to manage events on its own socket pool.
      
      Ideally, we want one thread per RX/TX queue/cpu, but we have no way to
      know after accept() or connect() on which queue/cpu a socket is managed.
      
      We normally use one cpu per RX queue (IRQ smp_affinity being properly
      set), so remembering on socket structure which cpu delivered last packet
      is enough to solve the problem.
      
      After accept(), connect(), or even file descriptor passing around
      processes, applications can use :
      
       int cpu;
       socklen_t len = sizeof(cpu);
      
       getsockopt(fd, SOL_SOCKET, SO_INCOMING_CPU, &cpu, &len);
      
      And use this information to put the socket into the right silo
      for optimal performance, as all networking stack should run
      on the appropriate cpu, without need to send IPI (RPS/RFS).
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2c8c56e1
    • Eric Dumazet's avatar
      tcp: move sk_mark_napi_id() at the right place · 3d97379a
      Eric Dumazet authored
      sk_mark_napi_id() is used to record for a flow napi id of incoming
      packets for busypoll sake.
      We should do this only on established flows, not on listeners.
      
      This was 'working' by virtue of the socket cloning, but doing
      this on SYN packets in unecessary cache line dirtying.
      
      Even if we move sk_napi_id in the same cache line than sk_lock,
      we are working to make SYN processing lockless, so it is desirable
      to set sk_napi_id only for established flows.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3d97379a
    • Eric Dumazet's avatar
      mlx4: restore conditional call to napi_complete_done() · 2e1af7d7
      Eric Dumazet authored
      After commit 1a288172 ("mlx4: use napi_complete_done()") we ended up
      calling napi_complete_done() in the case NAPI poll consumed all its
      budget.
      
      This added extra interrupt pressure, this patch restores proper
      behavior.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Fixes: 1a288172 ("mlx4: use napi_complete_done()")
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2e1af7d7
    • David S. Miller's avatar
      Merge branch 'sunvnet-next' · d21385fa
      David S. Miller authored
      Sowmini Varadhan says:
      
      ====================
      sunvnet: edge-case/race-conditions bug fixes
      
      This patch series contains fixes for race-conditions in sunvnet,
      that can encountered when there is a difference in latency between
      producer and consumer.
      
      Patch 1 addresses a case when the STOPPED LDC ack from a peer is
      processed before vnet_start_xmit can finish updating the dr->prod
      state.
      
      Patch 2 fixes the edge-case when outgoing data and incoming
      stopped-ack cross each other in flight.
      
      Patch 3 adds a missing rcu_read_unlock(), found by code-inspection.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d21385fa
    • Sowmini Varadhan's avatar
      sunvnet: Add missing rcu_read_unlock() in vnet_start_xmit · df20286a
      Sowmini Varadhan authored
      The out_dropped label will only do rcu_read_unlock for non-null port.
      So add the missing rcu_read_unlock() when bailing due to non-null port.
      Signed-off-by: default avatarSowmini Varadhan <sowmini.varadhan@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      df20286a
    • Sowmini Varadhan's avatar
      sunvnet: vnet_ack() should check if !start_cons to send a missed trigger · 777362d7
      Sowmini Varadhan authored
      As per comments in vnet_start_xmit, for the edge case
      when outgoing vnet_start_xmit() data and an incoming STOPPED
      ACK cross each other in flight, we may need to send the missed
      START trigger from maybe_tx_wakeup() after checking for a
      false value of start_cons
      Signed-off-by: default avatarSowmini Varadhan <sowmini.varadhan@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      777362d7
    • Sowmini Varadhan's avatar
      sunvnet: Fix race between vnet_start_xmit() and vnet_ack() · b0cffed5
      Sowmini Varadhan authored
      When vnet_start_xmit() is concurrent with vnet_ack(), we may
      have a race that looks like:
      
          thread 1                              thread 2
          vnet_start_xmit                       vnet_event_napi -> vnet_rx
      
      __vnet_tx_trigger for some desc X
      at this point dr->prod == X
                                              peer sends back a stopped ack for X
                                              we process X, but X == dr->prod
                                              so we bail out in vnet_ack with
                                              !idx_is_pending
      update dr->prod
      
      As a result of the fact that we never processed the stopped ack for X,
      the Tx path is led to incorrectly believe that the peer is still
      "started" and reading, but the peer has stopped reading, which will
      ultimately end in flow-control assertions.
      
      The fix is to synchronize the above 2 paths  on the netif_tx_lock.
      Signed-off-by: default avatarSowmini Varadhan <sowmini.varadhan@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b0cffed5
  2. 10 Nov, 2014 15 commits
    • Alban Bedel's avatar
      8139too: Allow using the largest possible MTU · 6f6e741f
      Alban Bedel authored
      This driver allows MTU up to 1518 bytes which is not enought to run
      batman-adv. Simply raise the maximum packet size up to the maximum
      allowed by the transmit descriptor, 1792 bytes, giving a maximum MTU
      of 1774 bytes.
      Signed-off-by: default avatarAlban Bedel <albeu@free.fr>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6f6e741f
    • Alban Bedel's avatar
      8139too: Allow setting MTU larger than 1500 · ef786f10
      Alban Bedel authored
      Replace the default ndo_change_mtu callback with one that allow
      setting MTU that the driver can handle.
      Signed-off-by: default avatarAlban Bedel <albeu@free.fr>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ef786f10
    • David S. Miller's avatar
      Merge tag 'master-2014-11-04' of... · b9217266
      David S. Miller authored
      Merge tag 'master-2014-11-04' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next
      
      John W. Linville says:
      
      ====================
      pull request: wireless-next 2014-11-07
      
      Please pull this batch of updates intended for the 3.19 stream!
      
      For the mac80211 bits, Johannes says:
      
      "This relatively large batch of changes is comprised of the following:
       * large mac80211-hwsim changes from Ben, Jukka and a bit myself
       * OCB/WAVE/11p support from Rostislav on behalf of the Czech Technical
         University in Prague and Volkswagen Group Research
       * minstrel VHT work from Karl
       * more CSA work from Luca
       * WMM admission control support in mac80211 (myself)
       * various smaller fixes, spelling corrections, and minor API additions"
      
      For the Bluetooth bits, Johan says:
      
      "Here's the first bluetooth-next pull request for 3.19. The vast majority
      of patches are for ieee802154 from Alexander Aring with various fixes
      and cleanups. There are also several LE/SMP fixes as well as improved
      support for handling LE devices that have lost their pairing information
      (the patches from Alfonso). Jukka provides a couple of stability fixes
      for 6lowpan and Szymon conformance fixes for RFCOMM. For the HCI drivers
      we have one new USB ID for an Acer controller as well as a reset
      handling fix for H5."
      
      For the Atheros bits, Kalle says:
      
      "Major changes are:
      
      o ethtool support (Ben)
      
      o print dev string prefix with debug hex buffers dump (Michal)
      
      o debugfs file to read calibration data from the firmware verification
        purposes (me)
      
      o fix fw_stats debugfs file, now results are more reliable (Michal)
      
      o firmware crash counters via debugfs (Ben&me)
      
      o various tracing points to debug firmware (Rajkumar)
      
      o make it possible to provide firmware calibration data via a file (me)
      
      And we have quite a lot of smaller fixes and clean up."
      
      For the iwlwifi bits, Emmanuel says:
      
      "The big new thing here is netdetect which allows the
      firmware to wake up the platform when a specific network
      is detected. Along with that I have fixes for d3 operation.
      The usual amount of rate scaling stuff - we now support STBC.
      The other commit that stands out is Johannes's work on
      devcoredump. He basically starts to use the standard
      infrastructure he built."
      
      Along with that are the usual sort of updates and such for ath9k,
      brcmfmac, wil6210, and a handful of other bits here and there...
      
      Please let me know if there are problems!
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b9217266
    • David S. Miller's avatar
      Merge branch 'raw_probe_proto_opt' · e344458f
      David S. Miller authored
      Herbert Xu says:
      
      ====================
      ipv4: Simplify raw_probe_proto_opt and avoid reading user iov twice
      
      This series rewrites the function raw_probe_proto_opt in a more
      readable fasion, and then fixes the long-standing bug where we
      read the probed bytes twice which means that what we're using to
      probe may in fact be invalid.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e344458f
    • Herbert Xu's avatar
      ipv4: Avoid reading user iov twice after raw_probe_proto_opt · c008ba5b
      Herbert Xu authored
      Ever since raw_probe_proto_opt was added it had the problem of
      causing the user iov to be read twice, once during the probe for
      the protocol header and once again in ip_append_data.
      
      This is a potential security problem since it means that whatever
      we're probing may be invalid.  This patch plugs the hole by
      firstly advancing the iov so we don't read the same spot again,
      and secondly saving what we read the first time around for use
      by ip_append_data.
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c008ba5b
    • Herbert Xu's avatar
      ipv4: Use standard iovec primitive in raw_probe_proto_opt · 32b5913a
      Herbert Xu authored
      The function raw_probe_proto_opt tries to extract the first two
      bytes from the user input in order to seed the IPsec lookup for
      ICMP packets.  In doing so it's processing iovec by hand and
      overcomplicating things.
      
      This patch replaces the manual iovec processing with a call to
      memcpy_fromiovecend.
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      32b5913a
    • David S. Miller's avatar
      net: Move bonding headers under include/net · 1ef8019b
      David S. Miller authored
      This ways drivers like cxgb4 don't need to do ugly relative includes.
      Reported-by: default avatarJoe Perches <joe@perches.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1ef8019b
    • Joe Perches's avatar
      cxgb4: Remove unnecessary struct in6_addr * casts · 4483589f
      Joe Perches authored
      Just use the address of the in6_addr.
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4483589f
    • David S. Miller's avatar
      Merge branch 'cxgb4-next' · c42e2533
      David S. Miller authored
      Hariprasad Shenai says:
      
      ====================
      RDMA/cxgb4,cxgb4vf,cxgb4i,csiostor: Cleanup macros
      
      This series moves the debugfs code to a new file debugfs.c and cleans up
      macros/register defines.
      
      Various patches have ended up changing the style of the symbolic macros/register
      defines and some of them used the macros/register defines that matches the
      output of the script from the hardware team.
      
      As a result, the current kernel.org files are a mix of different macro styles.
      Since this macro/register defines is used by five different drivers, a
      few patch series have ended up adding duplicate macro/register define entries
      with different styles. This makes these register define/macro files a complete
      mess and we want to make them clean and consistent.
      
      Will post few more series so that we can cover all the macros so that they all
      follow the same style to be consistent.
      
      The patches series is created against 'net-next' tree.
      And includes patches on cxgb4, cxgb4vf, iw_cxgb4, csiostor and cxgb4i driver.
      
      We have included all the maintainers of respective drivers. Kindly review the
      change and let us know in case of any review comments.
      
      V3: Use suffix instead of prefix for macros/register defines
      V2: Changes the description and cover-letter content to answer David Miller's
      question
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c42e2533
    • Hariprasad Shenai's avatar
      cxgb4: Cleanup macros so they follow the same style and look consistent, part 2 · e2ac9628
      Hariprasad Shenai authored
      Various patches have ended up changing the style of the symbolic macros/register
      defines to different style.
      
      As a result, the current kernel.org files are a mix of different macro styles.
      Since this macro/register defines is used by different drivers a
      few patch series have ended up adding duplicate macro/register define entries
      with different styles. This makes these register define/macro files a complete
      mess and we want to make them clean and consistent. This patch cleans up a part
      of it.
      Signed-off-by: default avatarHariprasad Shenai <hariprasad@chelsio.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e2ac9628
    • Hariprasad Shenai's avatar
      cxgb4: Cleanup macros so they follow the same style and look consistent · 6559a7e8
      Hariprasad Shenai authored
      Various patches have ended up changing the style of the symbolic macros/register
      to different style.
      
      As a result, the current kernel.org files are a mix of different macro styles.
      Since this macro/register defines is used by different drivers a
      few patch series have ended up adding duplicate macro/register define entries
      with different styles. This makes these register define/macro files a complete
      mess and we want to make them clean and consistent. This patch cleans up a part
      of it.
      Signed-off-by: default avatarHariprasad Shenai <hariprasad@chelsio.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6559a7e8
    • Hariprasad Shenai's avatar
    • Eric Dumazet's avatar
      mlx4: use napi_complete_done() · 1a288172
      Eric Dumazet authored
      To enable gro_flush_timeout, a driver has to use napi_complete_done()
      instead of napi_complete().
      
      Tested:
       Ran 200 netperf TCP_STREAM from A to B (10Gbe mlx4 link, 8 RX queues)
      
      Without this feature, we send back about 305,000 ACK per second.
      
      GRO aggregation ratio is low (811/305 = 2.65 segments per GRO packet)
      
      Setting a timer of 2000 nsec is enough to increase GRO packet sizes
      and reduce number of ACK packets. (811/19.2 = 42)
      
      Receiver performs less calls to upper stacks, less wakes up.
      This also reduces cpu usage on the sender, as it receives less ACK
      packets.
      
      Note that reducing number of wakes up increases cpu efficiency, but can
      decrease QPS, as applications wont have the chance to warmup cpu caches
      doing a partial read of RPC requests/answers if they fit in one skb.
      
      B:~# sar -n DEV 1 10 | grep eth0 | tail -1
      Average:         eth0 811269.80 305732.30 1199462.57  19705.72      0.00
      0.00      0.50
      
      B:~# echo 2000 >/sys/class/net/eth0/gro_flush_timeout
      
      B:~# sar -n DEV 1 10 | grep eth0 | tail -1
      Average:         eth0 811577.30  19230.80 1199916.51   1239.80      0.00
      0.00      0.50
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1a288172
    • Eric Dumazet's avatar
      net: gro: add a per device gro flush timer · 3b47d303
      Eric Dumazet authored
      Tuning coalescing parameters on NIC can be really hard.
      
      Servers can handle both bulk and RPC like traffic, with conflicting
      goals : bulk flows want as big GRO packets as possible, RPC want minimal
      latencies.
      
      To reach big GRO packets on 10Gbe NIC, one can use :
      
      ethtool -C eth0 rx-usecs 4 rx-frames 44
      
      But this penalizes rpc sessions, with an increase of latencies, up to
      50% in some cases, as NICs generally do not force an interrupt when
      a packet with TCP Push flag is received.
      
      Some NICs do not have an absolute timer, only a timer rearmed for every
      incoming packet.
      
      This patch uses a different strategy : Let GRO stack decides what do do,
      based on traffic pattern.
      
      Packets with Push flag wont be delayed.
      Packets without Push flag might be held in GRO engine, if we keep
      receiving data.
      
      This new mechanism is off by default, and shall be enabled by setting
      /sys/class/net/ethX/gro_flush_timeout to a value in nanosecond.
      
      To fully enable this mechanism, drivers should use napi_complete_done()
      instead of napi_complete().
      
      Tested:
       Ran 200 netperf TCP_STREAM from A to B (10Gbe mlx4 link, 8 RX queues)
      
      Without this feature, we send back about 305,000 ACK per second.
      
      GRO aggregation ratio is low (811/305 = 2.65 segments per GRO packet)
      
      Setting a timer of 2000 nsec is enough to increase GRO packet sizes
      and reduce number of ACK packets. (811/19.2 = 42)
      
      Receiver performs less calls to upper stacks, less wakes up.
      This also reduces cpu usage on the sender, as it receives less ACK
      packets.
      
      Note that reducing number of wakes up increases cpu efficiency, but can
      decrease QPS, as applications wont have the chance to warmup cpu caches
      doing a partial read of RPC requests/answers if they fit in one skb.
      
      B:~# sar -n DEV 1 10 | grep eth0 | tail -1
      Average:         eth0 811269.80 305732.30 1199462.57  19705.72      0.00
      0.00      0.50
      
      B:~# echo 2000 >/sys/class/net/eth0/gro_flush_timeout
      
      B:~# sar -n DEV 1 10 | grep eth0 | tail -1
      Average:         eth0 811577.30  19230.80 1199916.51   1239.80      0.00
      0.00      0.50
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3b47d303
    • Dave Taht's avatar
      rtnetlink: add babel protocol recognition · be955b29
      Dave Taht authored
      Babel uses rt_proto 42. Add to userspace visible header file.
      Signed-off-by: default avatarDave Taht <dave.taht@bufferbloat.net>
      Signed-off-by: default avatarStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      be955b29
  3. 09 Nov, 2014 1 commit
  4. 07 Nov, 2014 15 commits
  5. 06 Nov, 2014 1 commit