1. 24 Apr, 2016 16 commits
    • Eric Dumazet's avatar
      tcp-tso: do not split TSO packets at retransmit time · 10d3be56
      Eric Dumazet authored
      Linux TCP stack painfully segments all TSO/GSO packets before retransmits.
      
      This was fine back in the days when TSO/GSO were emerging, with their
      bugs, but we believe the dark age is over.
      
      Keeping big packets in write queues, but also in stack traversal
      has a lot of benefits.
       - Less memory overhead, because write queues have less skbs
       - Less cpu overhead at ACK processing.
       - Better SACK processing, as lot of studies mentioned how
         awful linux was at this ;)
       - Less cpu overhead to send the rtx packets
         (IP stack traversal, netfilter traversal, drivers...)
       - Better latencies in presence of losses.
       - Smaller spikes in fq like packet schedulers, as retransmits
         are not constrained by TCP Small Queues.
      
      1 % packet losses are common today, and at 100Gbit speeds, this
      translates to ~80,000 losses per second.
      Losses are often correlated, and we see many retransmit events
      leading to 1-MSS train of packets, at the time hosts are already
      under stress.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      10d3be56
    • Parthasarathy Bhuvaragan's avatar
      tipc: fix stale links after re-enabling bearer · 8cee83dd
      Parthasarathy Bhuvaragan authored
      Commit 42b18f60 ("tipc: refactor function tipc_link_timeout()"),
      introduced a bug which prevents sending of probe messages during
      link synchronization phase. This leads to hanging links, if the
      bearer is disabled/enabled after links are up.
      
      In this commit, we send the probe messages correctly.
      
      Fixes: 42b18f60 ("tipc: refactor function tipc_link_timeout()")
      Acked-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarParthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8cee83dd
    • David S. Miller's avatar
      Merge branch 'tcp-tcstamp_ack-frag-coalesce' · 6a74c196
      David S. Miller authored
      Martin KaFai Lau says:
      
      ====================
      tcp: Handle txstamp_ack when fragmenting/coalescing skbs
      
      This patchset is to handle the txstamp-ack bit when
      fragmenting/coalescing skbs.
      
      The second patch depends on the recently posted series
      for the net branch:
      "tcp: Merge timestamp info when coalescing skbs"
      
      A BPF prog is used to kprobe to sock_queue_err_skb()
      and print out the value of serr->ee.ee_data.  The BPF
      prog (run-able from bcc) is attached here:
      
      BPF prog used for testing:
      ~~~~~
      
      from __future__ import print_function
      from bcc import BPF
      
      bpf_text = """
      
      int trace_err_skb(struct pt_regs *ctx)
      {
      	struct sk_buff *skb = (struct sk_buff *)ctx->si;
      	struct sock *sk = (struct sock *)ctx->di;
      	struct sock_exterr_skb *serr;
      	u32 ee_data = 0;
      
      	if (!sk || !skb)
      		return 0;
      
      	serr = SKB_EXT_ERR(skb);
      	bpf_probe_read(&ee_data, sizeof(ee_data), &serr->ee.ee_data);
      	bpf_trace_printk("ee_data:%u\\n", ee_data);
      
      	return 0;
      };
      """
      
      b = BPF(text=bpf_text)
      b.attach_kprobe(event="sock_queue_err_skb", fn_name="trace_err_skb")
      print("Attached to kprobe")
      b.trace_print()
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6a74c196
    • Martin KaFai Lau's avatar
      tcp: Merge txstamp_ack in tcp_skb_collapse_tstamp · 2de8023e
      Martin KaFai Lau authored
      When collapsing skbs, txstamp_ack also needs to be merged.
      
      Retrans Collapse Test:
      ~~~~~~
      0.200 accept(3, ..., ...) = 4
      +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
      
      0.200 write(4, ..., 730) = 730
      +0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
      0.200 write(4, ..., 730) = 730
      +0 setsockopt(4, SOL_SOCKET, 37, [2176], 4) = 0
      0.200 write(4, ..., 11680) = 11680
      
      0.200 > P. 1:731(730) ack 1
      0.200 > P. 731:1461(730) ack 1
      0.200 > . 1461:8761(7300) ack 1
      0.200 > P. 8761:13141(4380) ack 1
      
      0.300 < . 1:1(0) ack 1 win 257 <sack 1461:2921,nop,nop>
      0.300 < . 1:1(0) ack 1 win 257 <sack 1461:4381,nop,nop>
      0.300 < . 1:1(0) ack 1 win 257 <sack 1461:5841,nop,nop>
      0.300 > P. 1:1461(1460) ack 1
      0.400 < . 1:1(0) ack 13141 win 257
      
      BPF Output Before:
      ~~~~~
      <No output due to missing SCM_TSTAMP_ACK timestamp>
      
      BPF Output After:
      ~~~~~
      <...>-2027  [007] d.s.    79.765921: : ee_data:1459
      
      Sacks Collapse Test:
      ~~~~~
      0.200 accept(3, ..., ...) = 4
      +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
      
      0.200 write(4, ..., 1460) = 1460
      +0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
      0.200 write(4, ..., 13140) = 13140
      +0 setsockopt(4, SOL_SOCKET, 37, [2176], 4) = 0
      
      0.200 > P. 1:1461(1460) ack 1
      0.200 > . 1461:8761(7300) ack 1
      0.200 > P. 8761:14601(5840) ack 1
      
      0.300 < . 1:1(0) ack 1 win 257 <sack 1461:14601,nop,nop>
      0.300 > P. 1:1461(1460) ack 1
      0.400 < . 1:1(0) ack 14601 win 257
      
      BPF Output Before:
      ~~~~~
      <No output due to missing SCM_TSTAMP_ACK timestamp>
      
      BPF Output After:
      ~~~~~
      <...>-2049  [007] d.s.    89.185538: : ee_data:14599
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Tested-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2de8023e
    • Martin KaFai Lau's avatar
      tcp: Carry txstamp_ack in tcp_fragment_tstamp · b51e13fa
      Martin KaFai Lau authored
      When a tcp skb is sliced into two smaller skbs (e.g. in
      tcp_fragment() and tso_fragment()),  it does not carry
      the txstamp_ack bit to the newly created skb if it is needed.
      The end result is a timestamping event (SCM_TSTAMP_ACK) will
      be missing from the sk->sk_error_queue.
      
      This patch carries this bit to the new skb2
      in tcp_fragment_tstamp().
      
      BPF Output Before:
      ~~~~~~
      <No output due to missing SCM_TSTAMP_ACK timestamp>
      
      BPF Output After:
      ~~~~~~
      <...>-2050  [000] d.s.   100.928763: : ee_data:14599
      
      Packetdrill Script:
      ~~~~~~
      +0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
      +0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
      +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      +0 bind(3, ..., ...) = 0
      +0 listen(3, 1) = 0
      
      0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
      0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
      0.200 < . 1:1(0) ack 1 win 257
      0.200 accept(3, ..., ...) = 4
      +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
      
      +0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
      0.200 write(4, ..., 14600) = 14600
      +0 setsockopt(4, SOL_SOCKET, 37, [2176], 4) = 0
      
      0.200 > . 1:7301(7300) ack 1
      0.200 > P. 7301:14601(7300) ack 1
      
      0.300 < . 1:1(0) ack 14601 win 257
      
      0.300 close(4) = 0
      0.300 > F. 14601:14601(0) ack 1
      0.400 < F. 1:1(0) ack 16062 win 257
      0.400 > . 14602:14602(0) ack 2
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Tested-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b51e13fa
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next · 11afbff8
      David S. Miller authored
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter updates for net-next
      
      The following patchset contains Netfilter updates for your net-next
      tree, mostly from Florian Westphal to sort out the lack of sufficient
      validation in x_tables and connlabel preparation patches to add
      nf_tables support. They are:
      
      1) Ensure we don't go over the ruleset blob boundaries in
         mark_source_chains().
      
      2) Validate that target jumps land on an existing xt_entry. This extra
         sanitization comes with a performance penalty when loading the ruleset.
      
      3) Introduce xt_check_entry_offsets() and use it from {arp,ip,ip6}tables.
      
      4) Get rid of the smallish check_entry() functions in {arp,ip,ip6}tables.
      
      5) Make sure the minimal possible target size in x_tables.
      
      6) Similar to #3, add xt_compat_check_entry_offsets() for compat code.
      
      7) Check that standard target size is valid.
      
      8) More sanitization to ensure that the target_offset field is correct.
      
      9) Add xt_check_entry_match() to validate that matches are well-formed.
      
      10-12) Three patch to reduce the number of parameters in
          translate_compat_table() for {arp,ip,ip6}tables by using a container
          structure.
      
      13) No need to return value from xt_compat_match_from_user(), so make
          it void.
      
      14) Consolidate translate_table() so it can be used by compat code too.
      
      15) Remove obsolete check for compat code, so we keep consistent with
          what was already removed in the native layout code (back in 2007).
      
      16) Get rid of target jump validation from mark_source_chains(),
          obsoleted by #2.
      
      17) Introduce xt_copy_counters_from_user() to consolidate counter
          copying, and use it from {arp,ip,ip6}tables.
      
      18,22) Get rid of unnecessary explicit inlining in ctnetlink for dump
          functions.
      
      19) Move nf_connlabel_match() to xt_connlabel.
      
      20) Skip event notification if connlabel did not change.
      
      21) Update of nf_connlabels_get() to make the upcoming nft connlabel
          support easier.
      
      23) Remove spinlock to read protocol state field in conntrack.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      11afbff8
    • David S. Miller's avatar
      Merge branch 'nla_align-more' · 8d9ea160
      David S. Miller authored
      Nicolas Dichtel says:
      
      ====================
      netlink: align attributes when needed (patchset #1)
      
      This is the continuation of the work done to align netlink attributes
      when these attributes contain some 64-bit fields.
      
      David, if the third patch is too big (or maybe the series), I can split it.
      Just tell me what you prefer.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8d9ea160
    • Nicolas Dichtel's avatar
      taskstats: use the libnl API to align nlattr on 64-bit · 80df5542
      Nicolas Dichtel authored
      Goal of this patch is to use the new libnl API to align netlink attribute
      when needed.
      The layout of the netlink message will be a bit different after the patch,
      because the padattr (TASKSTATS_TYPE_STATS) will be inside the nested
      attribute instead of before it.
      Signed-off-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      80df5542
    • Nicolas Dichtel's avatar
    • Nicolas Dichtel's avatar
      libnl: add nla_put_u64_64bit() helper · 73520786
      Nicolas Dichtel authored
      With this function, nla_data() is aligned on a 64-bit area.
      Signed-off-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      73520786
    • Nicolas Dichtel's avatar
      libnl: nla_put_msecs(): align on a 64-bit area · 2175d87c
      Nicolas Dichtel authored
      nla_data() is now aligned on a 64-bit area.
      Signed-off-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2175d87c
    • Nicolas Dichtel's avatar
      libnl: nla_put_s64(): align on a 64-bit area · 756a2f59
      Nicolas Dichtel authored
      nla_data() is now aligned on a 64-bit area.
      In fact, there is no user of this function.
      Signed-off-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      756a2f59
    • Nicolas Dichtel's avatar
      libnl: nla_put_net64(): align on a 64-bit area · e9bbe898
      Nicolas Dichtel authored
      nla_data() is now aligned on a 64-bit area.
      
      The temporary function nla_put_be64_32bit() is removed in this patch.
      Signed-off-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e9bbe898
    • Nicolas Dichtel's avatar
      libnl: nla_put_be64(): align on a 64-bit area · b46f6ded
      Nicolas Dichtel authored
      nla_data() is now aligned on a 64-bit area.
      
      A temporary version (nla_put_be64_32bit()) is added for nla_put_net64().
      This function is removed in the next patch.
      Signed-off-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b46f6ded
    • Nicolas Dichtel's avatar
      libnl: nla_put_le64(): align on a 64-bit area · e7479122
      Nicolas Dichtel authored
      nla_data() is now aligned on a 64-bit area.
      Signed-off-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e7479122
    • Nicolas Dichtel's avatar
      libnl: fix help of _64bit functions · 11a99573
      Nicolas Dichtel authored
      Fix typo and describe 'padattr'.
      
      Fixes: 089bf1a6 ("libnl: add more helpers to align attributes on 64-bit")
      Signed-off-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      11a99573
  2. 23 Apr, 2016 1 commit
  3. 21 Apr, 2016 23 commits
    • Linus Torvalds's avatar
      Merge tag 'rtc-4.6-3' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux · 5f44abd0
      Linus Torvalds authored
      Pull RTC fixes from Alexandre Belloni:
       "A few fixes for the RTC subsystem.  The documentation fix already
        missed 4.5 so I think it is worth taking it now:
      
        A documentation fix for s3c and two fixes for the ds1307"
      
      * tag 'rtc-4.6-3' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux:
        rtc: ds1307: Use irq when available for wakeup-source device
        rtc: ds1307: ds3231 temperature s16 overflow
        rtc: s3c: Document in binding that only s3c6410 needs a src clk
      5f44abd0
    • Linus Torvalds's avatar
      Merge tag 'pm+acpi-4.6-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · f78fe081
      Linus Torvalds authored
      Pull power management fixes from Rafael Wysocki:
       "Two fixes for issues introduced recently, one for an intel_pstate
        driver problem uncovered by the recent switch over from using timers
        and the other one for a potential cpufreq core problem related to
        system suspend/resume.
      
        Specifics:
      
         - Fix an intel_pstate driver problem causing CPUs to get stuck in the
           highest P-state when completely idle uncovered by the recent switch
           over from using timers (Rafael Wysocki).
      
         - Avoid attempts to get the current CPU frequency when all devices
           (like I2C controllers that may be nedded for that purpose) have
           been suspended during system suspend/resume (Rafael Wysocki)"
      
      * tag 'pm+acpi-4.6-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        cpufreq: Abort cpufreq_update_current_freq() for cpufreq_suspended set
        intel_pstate: Avoid getting stuck in high P-states when idle
      f78fe081
    • Nishanth Menon's avatar
      rtc: ds1307: Use irq when available for wakeup-source device · 38a7a73e
      Nishanth Menon authored
      With commit 8bc2a407 ("rtc: ds1307: add support for the
      DT property 'wakeup-source'") we lost the ability for rtc irq
      functionality for devices that are actually hooked on a real IRQ
      line and have capability to wakeup as well. This is not an expected
      behavior. So, instead of just not requesting IRQ, skip the IRQ
      requirement only if interrupts are not defined for the device.
      
      Fixes: 8bc2a407 ("rtc: ds1307: add support for the DT property 'wakeup-source'")
      Reported-by: default avatarTony Lindgren <tony@atomide.com>
      Cc: Michael Lange <linuxstuff@milaw.biz>
      Cc: Alexandre Belloni <alexandre.belloni@free-electrons.com>
      Signed-off-by: default avatarNishanth Menon <nm@ti.com>
      Signed-off-by: default avatarAlexandre Belloni <alexandre.belloni@free-electrons.com>
      38a7a73e
    • Zhuang Yuyao's avatar
      rtc: ds1307: ds3231 temperature s16 overflow · 9a3dce62
      Zhuang Yuyao authored
      while retrieving temperature from ds3231, the result may be overflow
      since s16 is too small for a multiplication with 250.
      
      ie. if temp_buf[0] == 0x2d, the result (s16 temp) will be negative.
      Signed-off-by: default avatarAkinobu Mita <akinobu.mita@gmail.com>
      Tested-by: default avatarMichael Tatarinov <kukabu@gmail.com>
      Signed-off-by: default avatarAlexandre Belloni <alexandre.belloni@free-electrons.com>
      9a3dce62
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · c5edde3a
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Fix memory leak in iwlwifi, from Matti Gottlieb.
      
       2) Add missing registration of netfilter arp_tables into initial
          namespace, from Florian Westphal.
      
       3) Fix potential NULL deref in DecNET routing code.
      
       4) Restrict NETLINK_URELEASE to truly bound sockets only, from Dmitry
          Ivanov.
      
       5) Fix dst ref counting in VRF, from David Ahern.
      
       6) Fix TSO segmenting limits in i40e driver, from Alexander Duyck.
      
       7) Fix heap leak in PACKET_DIAG_MCLIST, from Mathias Krause.
      
       8) Ravalidate IPV6 datagram socket cached routes properly, particularly
          with UDP, from Martin KaFai Lau.
      
       9) Fix endian bug in RDS dp_ack_seq handling, from Qing Huang.
      
      10) Fix stats typing in bcmgenet driver, from Eric Dumazet.
      
      11) Openvswitch needs to orphan SKBs before ipv6 fragmentation handing,
          from Joe Stringer.
      
      12) SPI device reference leak in spi_ks8895 PHY driver, from Mark Brown.
      
      13) atl2 doesn't actually support scatter-gather, so don't advertise the
          feature.  From Ben Hucthings.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (72 commits)
        openvswitch: use flow protocol when recalculating ipv6 checksums
        Driver: Vmxnet3: set CHECKSUM_UNNECESSARY for IPv6 packets
        atl2: Disable unimplemented scatter/gather feature
        net/mlx4_en: Split SW RX dropped counter per RX ring
        net/mlx4_core: Don't allow to VF change global pause settings
        net/mlx4_core: Avoid repeated calls to pci enable/disable
        net/mlx4_core: Implement pci_resume callback
        net: phy: spi_ks8895: Don't leak references to SPI devices
        net: ethernet: davinci_emac: Fix platform_data overwrite
        net: ethernet: davinci_emac: Fix Unbalanced pm_runtime_enable
        qede: Fix single MTU sized packet from firmware GRO flow
        qede: Fix setting Skb network header
        qede: Fix various memory allocation error flows for fastpath
        tcp: Merge tx_flags and tskey in tcp_shifted_skb
        tcp: Merge tx_flags and tskey in tcp_collapse_retrans
        drivers: net: cpsw: fix wrong regs access in cpsw_ndo_open
        tcp: Fix SOF_TIMESTAMPING_TX_ACK when handling dup acks
        openvswitch: Orphan skbs before IPv6 defrag
        Revert "Prevent NUll pointer dereference with two PHYs on cpsw"
        VSOCK: Only check error on skb_recv_datagram when skb is NULL
        ...
      c5edde3a
    • David S. Miller's avatar
      Merge branch 'geneve-vxlan-deps' · 22d37b6b
      David S. Miller authored
      Hannes Frederic Sowa says:
      
      ====================
      net: network drivers should not depend on geneve/vxlan
      
      This patchset removes the dependency of network drivers on vxlan or
      geneve, so those don't get autoloaded when the nic driver is loaded.
      
      Also audited the code such that vxlan_get_rx_port and geneve_get_rx_port
      are not called without rtnl lock.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      22d37b6b
    • Hannes Frederic Sowa's avatar
      geneve: break dependency with netdev drivers · 681e683f
      Hannes Frederic Sowa authored
      Equivalent to "vxlan: break dependency with netdev drivers", don't
      autoload geneve module in case the driver is loaded. Instead make the
      coupling weaker by using netdevice notifiers as proxy.
      
      Cc: Jesse Gross <jesse@kernel.org>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      681e683f
    • Hannes Frederic Sowa's avatar
      vxlan: break dependency with netdev drivers · b7aade15
      Hannes Frederic Sowa authored
      Currently all drivers depend and autoload the vxlan module because how
      vxlan_get_rx_port is linked into them. Remove this dependency:
      
      By using a new event type in the netdevice notifier call chain we proxy
      the request from the drivers to flush and resetup the vxlan ports not
      directly via function call but by the already existing netdevice
      notifier call chain.
      
      I added a separate new event type, NETDEV_OFFLOAD_PUSH_VXLAN, to do so.
      We don't need to save those ids, as the event type field is an unsigned
      long and using specialized event types for this purpose seemed to be a
      more elegant way. This also comes in beneficial if in future we want to
      add offloading knobs for vxlan.
      
      Cc: Jesse Gross <jesse@kernel.org>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b7aade15
    • Hannes Frederic Sowa's avatar
      qlcnic: protect qlicnic_attach_func with rtnl_lock · 50d65d78
      Hannes Frederic Sowa authored
      qlcnic_attach_func requires rtnl_lock to be held.
      
      Cc: Dept-GELinuxNICDev@qlogic.com
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      50d65d78
    • Hannes Frederic Sowa's avatar
      ixgbe: protect vxlan_get_rx_port in ixgbe_service_task with rtnl_lock · b1f99a78
      Hannes Frederic Sowa authored
      vxlan_get_rx_port requires rtnl_lock to be held.
      
      Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
      Cc: Shannon Nelson <shannon.nelson@intel.com>
      Cc: Carolyn Wyborny <carolyn.wyborny@intel.com>
      Cc: Don Skidmore <donald.c.skidmore@intel.com>
      Cc: Bruce Allan <bruce.w.allan@intel.com>
      Cc: John Ronciak <john.ronciak@intel.com>
      Cc: Mitch Williams <mitch.a.williams@intel.com>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b1f99a78
    • Hannes Frederic Sowa's avatar
      mlx4: protect mlx4_en_start_port in mlx4_en_restart with rtnl_lock · 0c5c3252
      Hannes Frederic Sowa authored
      mlx4_en_start_port requires rtnl_lock to be held.
      
      Cc: Eugenia Emantayev <eugenia@mellanox.com>
      Cc: Yishai Hadas <yishaih@mellanox.com>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0c5c3252
    • Hannes Frederic Sowa's avatar
      fm10k: protect fm10k_open in fm10k_io_resume with rtnl_lock · 41419b93
      Hannes Frederic Sowa authored
      fm10k_open requires rtnl_lock to be held.
      
      Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
      Cc: Shannon Nelson <shannon.nelson@intel.com>
      Cc: Carolyn Wyborny <carolyn.wyborny@intel.com>
      Cc: Don Skidmore <donald.c.skidmore@intel.com>
      Cc: Bruce Allan <bruce.w.allan@intel.com>
      Cc: John Ronciak <john.ronciak@intel.com>
      Cc: Mitch Williams <mitch.a.williams@intel.com>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      41419b93
    • Hannes Frederic Sowa's avatar
      benet: be_resume needs to protect be_open with rtnl_lock · 08d9910c
      Hannes Frederic Sowa authored
      be_open calls down to functions which expects rtnl lock to be held.
      
      Cc: Sathya Perla <sathya.perla@broadcom.com>
      Cc: Ajit Khaparde <ajit.khaparde@broadcom.com>
      Cc: Padmanabh Ratnakar <padmanabh.ratnakar@broadcom.com>
      Cc: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
      Cc: Somnath Kotur <somnath.kotur@broadcom.com>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      08d9910c
    • Simon Horman's avatar
      openvswitch: use flow protocol when recalculating ipv6 checksums · b4f70527
      Simon Horman authored
      When using masked actions the ipv6_proto field of an action
      to set IPv6 fields may be zero rather than the prevailing protocol
      which will result in skipping checksum recalculation.
      
      This patch resolves the problem by relying on the protocol
      in the flow key rather than that in the set field action.
      
      Fixes: 83d2b9ba ("net: openvswitch: Support masked set actions.")
      Cc: Jarno Rajahalme <jrajahalme@nicira.com>
      Signed-off-by: default avatarSimon Horman <simon.horman@netronome.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b4f70527
    • Shrikrishna Khare's avatar
      Driver: Vmxnet3: set CHECKSUM_UNNECESSARY for IPv6 packets · f0d43780
      Shrikrishna Khare authored
      For IPv6, if the device indicates that the checksum is correct, set
      CHECKSUM_UNNECESSARY.
      Reported-by: default avatarSubbarao Narahari <snarahari@vmware.com>
      Signed-off-by: default avatarShrikrishna Khare <skhare@vmware.com>
      Signed-off-by: default avatarJin Heo <heoj@vmware.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f0d43780
    • Ben Hutchings's avatar
      atl2: Disable unimplemented scatter/gather feature · f43bfaed
      Ben Hutchings authored
      atl2 includes NETIF_F_SG in hw_features even though it has no support
      for non-linear skbs.  This bug was originally harmless since the
      driver does not claim to implement checksum offload and that used to
      be a requirement for SG.
      
      Now that SG and checksum offload are independent features, if you
      explicitly enable SG *and* use one of the rare protocols that can use
      SG without checkusm offload, this potentially leaks sensitive
      information (before you notice that it just isn't working).  Therefore
      this obscure bug has been designated CVE-2016-2117.
      Reported-by: default avatarJustin Yackoski <jyackoski@crypto-nite.com>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      Fixes: ec5f0615 ("net: Kill link between CSUM and SG features.")
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f43bfaed
    • Alexander Duyck's avatar
      net: Add support for IP ID mangling TSO in cases that require encapsulation · 7f348a60
      Alexander Duyck authored
      This patch adds support for NETIF_F_TSO_MANGLEID if a given tunnel supports
      NETIF_F_TSO.  This way if needed a device can then later enable the TSO
      with IP ID mangling and the tunnels on top of that device can then also
      make use of the IP ID mangling as well.
      Signed-off-by: default avatarAlexander Duyck <aduyck@mirantis.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7f348a60
    • David S. Miller's avatar
      Merge branch 'mlx5-next' · 1df845be
      David S. Miller authored
      Saeed Mahameed says:
      
      ====================
      Mellanox 100G mlx5 driver receive path optimizations
      
      Changes from V2:
      	- Rebased to 46e7b8d8 ("net: dsa: kill circular reference with slave priv")
      	- Updated: ("net/mlx5e: Support RX multi-packet WQE (Striding RQ)")
      		* Per Eric Dumazet comment we changed the driver memory handling scheme to
      		work with order-0 pages rather than order-5 via split_page().
      		* This means that now a mlx5e rx skb can hold one or (more in case of HW LRO)
                      skb frag each pointing to a 4K order-0 page rather than one frag with order-5 page.
      	- Updated: ("net/mlx5e: Add fragmented memory support for RX multi packet WQE")
      		* Code refactoring and code reuse due the split_page() mechanism,
      		  now the MPWQE and fragmented MPWQE handling almost look the same,
      		  and share most of the code.
      	- In some cases we see 2%-3% packet rate degradation in comparison to the order-5 pages approach,
      	  due to split_page() cpu consumption, but still we do see 3%-10% improvement in comparison to the
                current linear SKB approach.
      	- We do believe that now the driver memory scheme is significantly less vulnerable
      	  to the memory DOS attack Eric pointed at.
      
      Changes from V1:
      	- Rebased to efde611b ("Merge branch 'nfp-next'")
      	- Dropped: ("net/mlx5: Refactor mlx5_core_mr to mkey")
                      Already merged into 4.6 from rdma tree.
      	- Dropped: ("net/mlx5_core: Add ConnectX-5 to list of supported devices")
                      Will be pushed to net as we want it in 4.6 release.
      	- Dropped: ("net/mlx5e: Change RX moderation period to be based on CQE")
                      Will be pushed in a later series with full software based adaptive moderation.
      	- Added: ("net/mlx5e: Delay skb->data access")
      		Small trivial optimization.
      	- Updated: ("net/mlx5e: Support RX multi-packet WQE (Striding RQ)")
      	 	Changed Striding RQ defaults to:
      			> 	NUM WQEs = 16
      			> 	Strides Per WQE = 1024
      			> 	Stride Size = 128
      	- Updated: ("net/mlx5e: Use napi_alloc_skb for RX SKB allocations")
      		Consider the IP packet alignment already done in napi_alloc_skb.
      
      Changes from V0:
      	- Fixed a typo in commit message reported by Sergei
      	- Align SKB fragments truesize to stride size
      	- Use skb_add_rx_frag and remove the use of SKB_TRUESIZE
      	- Fix: # MTTs alignment on Power PC
      	- Fix: Free original (unaligned) pointer of MTT array
      	- Use dev_alloc_pages and dev_alloc_page
      	- Extend the stats.buff_alloc_err counter
      	- Reform the copying of packet header into skb linear data
      	- Add compiler hints for conditional statements
      	- Prefetch skd->data prior to copying packet header into it
      	- Rework: mlx5e_complete_rx_fragmented_mpwqe
      	- Handle SKB fragments before linear data
      	- Dropped ("net/mlx5e: Prefetch next RX CQE") for now
      	- Added a small patch that Adds ConnectX-5 devices to the list of supported devices
      	- Rebased to 1cdba550 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next")
      
      This series includes Some RX modifications and optimizations for
      the mlx5 Ethernet driver.
      
      From Rana, we have one patch that adds the support for Connectx-4
      queue counters.
      
      From Tariq, several patches that are centralized around improving
      RX path message rate, CPU and Memory utilization, in each patch
      commit message you will find the performance improvements numbers
      related to that specific patch.
      
      In the 2nd patch we used a queue counter to report "out of buffer"
      dropped packet count, "Dropped packets due to lack of software resources"
      
      3rd patch modifies the driver's to RSS default value to be spread along the
      close NUMA node cores only for better out of the box experience.
      
      In the 4th and 5th patches we utilized the use of RX multi-packet WQE
      (Striding RQ) for better memory utilization especially in case of hardware
      LRO is enabled and for better message rate for small packets.
      
      In the 6th and 7th patches we added a fallback mechanism to use fragmented
      memory when allocating large WQE strides fails, using UMR
      (User Memory Registration) and ICO (Internal Control Operations) SQs.
      
      In the 8th to 11th patches we did some small modification which show some small
      extra improvements.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1df845be
    • Tariq Toukan's avatar
      net/mlx5e: Add ethtool counter for RX buffer allocation failures · 54984407
      Tariq Toukan authored
      Counts the number of RX buffer allocation failures and shows it
      in ethtool statistics.
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      54984407
    • Saeed Mahameed's avatar
      net/mlx5e: Delay skb->data access · e20a0db3
      Saeed Mahameed authored
      Move mlx5e_handle_csum and eth_type_trans to the end of
      mlx5e_build_rx_skb to gain some more time before accessing
      skb->data, to reduce cache misses.
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e20a0db3
    • Tariq Toukan's avatar
      net/mlx5e: Remove redundant barrier · 1bfec316
      Tariq Toukan authored
      The bit-op operation one line before is an explicit barrier
      by itself.
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1bfec316
    • Tariq Toukan's avatar
      net/mlx5e: Use napi_alloc_skb for RX SKB allocations · c5adb96f
      Tariq Toukan authored
      Instead of netdev_alloc_skb, we use the napi_alloc_skb function
      which is designated to allocate skbuff's for RX in a
      channel-specific NAPI instance, and implies the IP packet alignment.
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c5adb96f
    • Tariq Toukan's avatar
      net/mlx5e: Add fragmented memory support for RX multi packet WQE · bc77b240
      Tariq Toukan authored
      If the allocation of a linear (physically continuous) MPWQE fails,
      we allocate a fragmented MPWQE.
      
      This is implemented via device's UMR (User Memory Registration)
      which allows to register multiple memory fragments into ConnectX
      hardware as a continuous buffer.
      UMR registration is an asynchronous operation and is done via
      ICO SQs.
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bc77b240