1. 22 Oct, 2015 12 commits
    • David S. Miller's avatar
      Merge branch 'bpf-perf' · 721daebb
      David S. Miller authored
      Alexei Starovoitov says:
      
      ====================
      bpf_perf_event_output helper
      
      Over the last year there were multiple attempts to let eBPF programs
      output data into perf events by He Kuang and Wangnan.
      The last one was:
      https://lkml.org/lkml/2015/7/20/736
      It was almost perfect with exception that all bpf programs would sent
      data into one global perf_event.
      This patch set takes different approach by letting user space
      open independent PERF_COUNT_SW_BPF_OUTPUT events, so that program
      output won't collide.
      
      Wangnan is working on corresponding perf patches.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      721daebb
    • Alexei Starovoitov's avatar
      samples: bpf: add bpf_perf_event_output example · 39111695
      Alexei Starovoitov authored
      Performance test and example of bpf_perf_event_output().
      kprobe is attached to sys_write() and trivial bpf program streams
      pid+cookie into userspace via PERF_COUNT_SW_BPF_OUTPUT event.
      
      Usage:
      $ sudo ./bld_x64/samples/bpf/trace_output
      recv 2968913 events per sec
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      39111695
    • Alexei Starovoitov's avatar
      bpf: introduce bpf_perf_event_output() helper · a43eec30
      Alexei Starovoitov authored
      This helper is used to send raw data from eBPF program into
      special PERF_TYPE_SOFTWARE/PERF_COUNT_SW_BPF_OUTPUT perf_event.
      User space needs to perf_event_open() it (either for one or all cpus) and
      store FD into perf_event_array (similar to bpf_perf_event_read() helper)
      before eBPF program can send data into it.
      
      Today the programs triggered by kprobe collect the data and either store
      it into the maps or print it via bpf_trace_printk() where latter is the debug
      facility and not suitable to stream the data. This new helper replaces
      such bpf_trace_printk() usage and allows programs to have dedicated
      channel into user space for post-processing of the raw data collected.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a43eec30
    • Alexei Starovoitov's avatar
      perf: pad raw data samples automatically · fa128e6a
      Alexei Starovoitov authored
      Instead of WARN_ON in perf_event_output() on unpaded raw samples,
      pad them automatically.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fa128e6a
    • Brenden Blanco's avatar
      ipvlan: read direct ifindex instead of iflink · 63b11e75
      Brenden Blanco authored
      In the ipv4 outbound path of an ipvlan device in l3 mode, the ifindex is
      being grabbed from dev_get_iflink. This works for the physical device
      case, since as the documentation of that function notes: "Physical
      interfaces have the same 'ifindex' and 'iflink' values.".  However, if
      the master device is a veth, and the pairs are in separate net
      namespaces, the route lookup will fail with -ENODEV due to outer veth
      pair being in a separate namespace from the ipvlan master/routing
      namespace.
      
        ns0    |   ns1    |   ns2
       veth0a--|--veth0b--|--ipvl0
      
      In ipvlan_process_v4_outbound(), a packet sent from ipvl0 in the above
      configuration will pass fl.flowi4_oif == veth0a to
      ip_route_output_flow(), but *net == ns1.
      
      Notice also that ipv6 processing is not using iflink. Since there is a
      discrepancy in usage, fixup both v4 and v6 case to use local dev
      variable.
      
      Tested this with l3 ipvlan on top of veth, as well as with single
      physical interface in the top namespace.
      Signed-off-by: default avatarBrenden Blanco <bblanco@plumgrid.com>
      Reviewed-by: default avatarJiri Benc <jbenc@redhat.com>
      Acked-by: default avatarMahesh Bandewar <maheshb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      63b11e75
    • Eric Dumazet's avatar
      tcp: fastopen: limit max_qlen · dbf650b6
      Eric Dumazet authored
      Allowing an application to set whatever limit for
      the list of recently RST fastopen sessions [1] is not wise,
      as it open ways to deplete kernel memory.
      
      Cap the user provided limit by somaxconn sysctl,
      like listen() backlog.
      
      [1] https://tools.ietf.org/html/rfc7413#section-5.1Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dbf650b6
    • Vivien Didelot's avatar
      net: mdio-gpio: move platform data header · e2aacd96
      Vivien Didelot authored
      This header file only contains the platform data structure definition,
      so move it to the include/linux/platform_data/ directory.
      Signed-off-by: default avatarVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e2aacd96
    • Vivien Didelot's avatar
      ARM: gemini: remove unnecessary mdio-gpio includes · 844338e5
      Vivien Didelot authored
      Remove the inclusion of linux/mdio-gpio.h in nas4220b, wbd111 and wbd222
      boards since mdio-gpio is not used.
      Signed-off-by: default avatarVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      844338e5
    • Wu Fengguang's avatar
      net: hisilicon: fix ptr_ret.cocci warnings · c6aa74d5
      Wu Fengguang authored
      drivers/net/ethernet/hisilicon/hns/hnae.c:442:1-3: WARNING: PTR_ERR_OR_ZERO can be used
      
       Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR
      
      Generated by: scripts/coccinelle/api/ptr_ret.cocci
      
      CC: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c6aa74d5
    • Eric Dumazet's avatar
      ipv6: gro: support sit protocol · feec0cb3
      Eric Dumazet authored
      Tom Herbert added SIT support to GRO with commit
      19424e05 ("sit: Add gro callbacks to sit_offload"),
      later reverted by Herbert Xu.
      
      The problem came because Tom patch was building GRO
      packets without proper meta data : If packets were locally
      delivered, we would not care.
      
      But if packets needed to be forwarded, GSO engine was not
      able to segment individual segments.
      
      With the following patch, we correctly set skb->encapsulation
      and inner network header. We also update gso_type.
      
      Tested:
      
      Server :
      netserver
      modprobe dummy
      ifconfig dummy0 8.0.0.1 netmask 255.255.255.0 up
      arp -s 8.0.0.100 4e:32:51:04:47:e5
      iptables -I INPUT -s 10.246.7.151 -j TEE --gateway 8.0.0.100
      ifconfig sixtofour0
      sixtofour0 Link encap:IPv6-in-IPv4
                inet6 addr: 2002:af6:798::1/128 Scope:Global
                inet6 addr: 2002:af6:798::/128 Scope:Global
                UP RUNNING NOARP  MTU:1480  Metric:1
                RX packets:411169 errors:0 dropped:0 overruns:0 frame:0
                TX packets:409414 errors:0 dropped:0 overruns:0 carrier:0
                collisions:0 txqueuelen:0
                RX bytes:20319631739 (20.3 GB)  TX bytes:29529556 (29.5 MB)
      
      Client :
      netperf -H 2002:af6:798::1 -l 1000 &
      
      Checked on server traffic copied on dummy0 and verify segments were
      properly rebuilt, with proper IP headers, TCP checksums...
      
      tcpdump on eth0 shows proper GRO aggregation takes place.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarTom Herbert <tom@herbertland.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      feec0cb3
    • Eric Dumazet's avatar
      net: dummy: add more features · 8f3af277
      Eric Dumazet authored
      While testing my SIT/GRO patch using netfilter TEE module and a dummy
      device, I found some features were missing :
      
      TSO IPv6, UFO, and encapsulated traffic.
      
      ethtool -k dummy0 now gives :
      ...
      tcp-segmentation-offload: on
      	tx-tcp-segmentation: on
      	tx-tcp-ecn-segmentation: on
      	tx-tcp6-segmentation: on
      udp-fragmentation-offload: on
      ...
      tx-gre-segmentation: on
      tx-ipip-segmentation: on
      tx-sit-segmentation: on
      tx-udp_tnl-segmentation: on
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8f3af277
    • Arad, Ronen's avatar
      netlink: Rightsize IFLA_AF_SPEC size calculation · b1974ed0
      Arad, Ronen authored
      if_nlmsg_size() overestimates the minimum allocation size of netlink
      dump request (when called from rtnl_calcit()) or the size of the
      message (when called from rtnl_getlink()). This is because
      ext_filter_mask is not supported by rtnl_link_get_af_size() and
      rtnl_link_get_size().
      
      The over-estimation is significant when at least one netdev has many
      VLANs configured (8 bytes for each configured VLAN).
      
      This patch-set "rightsizes" the protocol specific attribute size
      calculation by propagating ext_filter_mask to rtnl_link_get_af_size()
      and adding this a argument to get_link_af_size op in rtnl_af_ops.
      
      Bridge module already used filtering aware sizing for notifications.
      br_get_link_af_size_filtered() is consistent with the modified
      get_link_af_size op so it replaces br_get_link_af_size() in br_af_ops.
      br_get_link_af_size() becomes unused and thus removed.
      Signed-off-by: default avatarRonen Arad <ronen.arad@intel.com>
      Acked-by: default avatarSridhar Samudrala <sridhar.samudrala@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b1974ed0
  2. 21 Oct, 2015 10 commits
    • Elad Raz's avatar
      Adding switchdev ageing notification on port bridged · 6ac311ae
      Elad Raz authored
      Configure ageing time to the HW for newly bridged device
      
      CC: Scott Feldman <sfeldma@gmail.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: default avatarElad Raz <eladr@mellanox.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Acked-by: default avatarScott Feldman <sfeldma@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6ac311ae
    • David S. Miller's avatar
      Merge branch 'tcp-rack' · eb9fae32
      David S. Miller authored
      Yuchung Cheng says:
      
      ====================
      RACK loss detection
      
      RACK (Recent ACK) loss recovery uses the notion of time instead of
      packet sequence (FACK) or counts (dupthresh).
      
      It's inspired by the FACK heuristic in tcp_mark_lost_retrans(): when a
      limited transmit (new data packet) is sacked in recovery, then any
      retransmission sent before that newly sacked packet was sent must have
      been lost, since at least one round trip time has elapsed.
      
      But that existing heuristic from tcp_mark_lost_retrans()
      has several limitations:
        1) it can't detect tail drops since it depends on limited transmit
        2) it's disabled upon reordering (assumes no reordering)
        3) it's only enabled in fast recovery but not timeout recovery
      
      RACK addresses these limitations with a core idea: an unacknowledged
      packet P1 is deemed lost if a packet P2 that was sent later is is
      s/acked, since at least one round trip has passed.
      
      Since RACK cares about the time sequence instead of the data sequence
      of packets, it can detect tail drops when a later retransmission is
      s/acked, while FACK or dupthresh can't. For reordering RACK uses a
      dynamically adjusted reordering window ("reo_wnd") to reduce false
      positives on ever (small) degree of reordering, similar to the delayed
      Early Retransmit.
      
      In the current patch set RACK is only a supplemental loss detection
      and does not trigger fast recovery. However we are developing RACK
      to replace or consolidate FACK/dupthresh, early retransmit, and
      thin-dupack. These heuristics all implicitly bear the time notion.
      For example, the delayed Early Retransmit is simply applying RACK
      to trigger the fast recovery with small inflight.
      
      RACK requires measuring the minimum RTT. Tracking a global min is less
      robust due to traffic engineering pathing changes. Therefore it uses a
      windowed filter by Kathleen Nichols. The min RTT can also be useful
      for various other purposes like congestion control or stat monitoring.
      
      This patch has been used on Google servers for well over 1 year. RACK
      has also been implemented in the QUIC protocol. We are submitting an
      IETF draft as well.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      eb9fae32
    • Yuchung Cheng's avatar
      tcp: use RACK to detect losses · 4f41b1c5
      Yuchung Cheng authored
      This patch implements the second half of RACK that uses the the most
      recent transmit time among all delivered packets to detect losses.
      
      tcp_rack_mark_lost() is called upon receiving a dubious ACK.
      It then checks if an not-yet-sacked packet was sent at least
      "reo_wnd" prior to the sent time of the most recently delivered.
      If so the packet is deemed lost.
      
      The "reo_wnd" reordering window starts with 1msec for fast loss
      detection and changes to min-RTT/4 when reordering is observed.
      We found 1msec accommodates well on tiny degree of reordering
      (<3 pkts) on faster links. We use min-RTT instead of SRTT because
      reordering is more of a path property but SRTT can be inflated by
      self-inflicated congestion. The factor of 4 is borrowed from the
      delayed early retransmit and seems to work reasonably well.
      
      Since RACK is still experimental, it is now used as a supplemental
      loss detection on top of existing algorithms. It is only effective
      after the fast recovery starts or after the timeout occurs. The
      fast recovery is still triggered by FACK and/or dupack threshold
      instead of RACK.
      
      We introduce a new sysctl net.ipv4.tcp_recovery for future
      experiments of loss recoveries. For now RACK can be disabled by
      setting it to 0.
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4f41b1c5
    • Yuchung Cheng's avatar
      tcp: track the packet timings in RACK · 659a8ad5
      Yuchung Cheng authored
      This patch is the first half of the RACK loss recovery.
      
      RACK loss recovery uses the notion of time instead
      of packet sequence (FACK) or counts (dupthresh). It's inspired by the
      previous FACK heuristic in tcp_mark_lost_retrans(): when a limited
      transmit (new data packet) is sacked, then current retransmitted
      sequence below the newly sacked sequence must been lost,
      since at least one round trip time has elapsed.
      
      But it has several limitations:
      1) can't detect tail drops since it depends on limited transmit
      2) is disabled upon reordering (assumes no reordering)
      3) only enabled in fast recovery ut not timeout recovery
      
      RACK (Recently ACK) addresses these limitations with the notion
      of time instead: a packet P1 is lost if a later packet P2 is s/acked,
      as at least one round trip has passed.
      
      Since RACK cares about the time sequence instead of the data sequence
      of packets, it can detect tail drops when later retransmission is
      s/acked while FACK or dupthresh can't. For reordering RACK uses a
      dynamically adjusted reordering window ("reo_wnd") to reduce false
      positives on ever (small) degree of reordering.
      
      This patch implements tcp_advanced_rack() which tracks the
      most recent transmission time among the packets that have been
      delivered (ACKed or SACKed) in tp->rack.mstamp. This timestamp
      is the key to determine which packet has been lost.
      
      Consider an example that the sender sends six packets:
      T1: P1 (lost)
      T2: P2
      T3: P3
      T4: P4
      T100: sack of P2. rack.mstamp = T2
      T101: retransmit P1
      T102: sack of P2,P3,P4. rack.mstamp = T4
      T205: ACK of P4 since the hole is repaired. rack.mstamp = T101
      
      We need to be careful about spurious retransmission because it may
      falsely advance tp->rack.mstamp by an RTT or an RTO, causing RACK
      to falsely mark all packets lost, just like a spurious timeout.
      
      We identify spurious retransmission by the ACK's TS echo value.
      If TS option is not applicable but the retransmission is acknowledged
      less than min-RTT ago, it is likely to be spurious. We refrain from
      using the transmission time of these spurious retransmissions.
      
      The second half is implemented in the next patch that marks packet
      lost using RACK timestamp.
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      659a8ad5
    • Yuchung Cheng's avatar
      tcp: skb_mstamp_after helper · 625a5e10
      Yuchung Cheng authored
      a helper to prepare the first main RACK patch.
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      625a5e10
    • Yuchung Cheng's avatar
      tcp: add tcp_tsopt_ecr_before helper · 77c63127
      Yuchung Cheng authored
      a helper to prepare the main RACK patch
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      77c63127
    • Yuchung Cheng's avatar
      tcp: remove tcp_mark_lost_retrans() · af82f4e8
      Yuchung Cheng authored
      Remove the existing lost retransmit detection because RACK subsumes
      it completely. This also stops the overloading the ack_seq field of
      the skb control block.
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      af82f4e8
    • Yuchung Cheng's avatar
      tcp: track min RTT using windowed min-filter · f6722583
      Yuchung Cheng authored
      Kathleen Nichols' algorithm for tracking the minimum RTT of a
      data stream over some measurement window. It uses constant space
      and constant time per update. Yet it almost always delivers
      the same minimum as an implementation that has to keep all
      the data in the window. The measurement window is tunable via
      sysctl.net.ipv4.tcp_min_rtt_wlen with a default value of 5 minutes.
      
      The algorithm keeps track of the best, 2nd best & 3rd best min
      values, maintaining an invariant that the measurement time of
      the n'th best >= n-1'th best. It also makes sure that the three
      values are widely separated in the time window since that bounds
      the worse case error when that data is monotonically increasing
      over the window.
      
      Upon getting a new min, we can forget everything earlier because
      it has no value - the new min is less than everything else in the
      window by definition and it's the most recent. So we restart fresh
      on every new min and overwrites the 2nd & 3rd choices. The same
      property holds for the 2nd & 3rd best.
      
      Therefore we have to maintain two invariants to maximize the
      information in the samples, one on values (1st.v <= 2nd.v <=
      3rd.v) and the other on times (now-win <=1st.t <= 2nd.t <= 3rd.t <=
      now). These invariants determine the structure of the code
      
      The RTT input to the windowed filter is the minimum RTT measured
      from ACK or SACK, or as the last resort from TCP timestamps.
      
      The accessor tcp_min_rtt() returns the minimum RTT seen in the
      window. ~0U indicates it is not available. The minimum is 1usec
      even if the true RTT is below that.
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f6722583
    • Yuchung Cheng's avatar
      tcp: apply Kern's check on RTTs used for congestion control · 9e45a3e3
      Yuchung Cheng authored
      Currently ca_seq_rtt_us does not use Kern's check. Fix that by
      checking if any packet acked is a retransmit, for both RTT used
      for RTT estimation and congestion control.
      
      Fixes: 5b08e47c ("tcp: prefer packet timing to TS-ECR for RTT")
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9e45a3e3
    • David S. Miller's avatar
      Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue · c8fdc324
      David S. Miller authored
      Jeff Kirsher says:
      
      ====================
      Intel Wired LAN Driver Updates 2015-10-19
      
      This series contains updates to i40e and i40evf only.
      
      Kiran adds a spinlock around code accessing VSI MAC filter list to
      ensure that we are synchronizing access to the filter list, otherwise
      we can end up with multiple accesses at the same time which can cause
      the VSI MAC filter list to get in an unstable or corrupted state.
      
      Jesse fixes overlong BIT defines, where the RSS enabling call were
      mistakenly missed.  Also fixes a bug where the enable function was
      enabling the interrupt twice while trying to update the two interrupt
      throttle rate thresholds for Rx and Tx, while refactoring the IRQ
      enable function to simplify reading the flow.  Addressed the high
      CPU utilization of some small streaming workloads that the driver should
      reduce CPU in.
      
      Anjali fixes two X722 issues with respect to EEPROM checksum verify and
      reading NVM version info.  Fixed where a mask value was accidentally
      replaced with a bit mask causing Flow Director sideband to be broken.
      
      Alex Duyck fixes areas of the drivers which run from hard interrupt
      context or with interrupts already disabled in netpoll, so use
      napi_schedule_irqoff() instead of napi_schedule().
      
      Mitch fixes the VF drivers to not easily give up when it is not able
      to communicate with the PF driver.
      
      Carolyn fixes a problem where our tools MAC loopback test, after driver
      unbind would fail because the hardware was configured for multiqueue and
      unbind operation did not clear this configuration.  Also fixed a issue
      where the NVMUpdate tool gets bad data from the PHY when using the PHY
      NVM feature because of contention on the MDIO interface from getting
      PHY capability calls from the driver during regular operations.
      
      Catherine fixed an issue where we were checking if autoneg was allowed
      to change before checking if autoneg was changing, these checks need to
      be in the reverse order.
      
      Jean Sacren fixes up an function header comment to align the kernel-docs
      with the actual code.
      
      v2: Cleaned up the use of spin_is_locked() in patch 1 based on feedback
          from David Miller, since it always evaluates to zero on uni-processor
          builds
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c8fdc324
  3. 20 Oct, 2015 1 commit
  4. 19 Oct, 2015 17 commits