1. 26 Oct, 2016 14 commits
  2. 23 Oct, 2016 11 commits
  3. 22 Oct, 2016 7 commits
    • David S. Miller's avatar
      Merge branch 'bpf-numa-id' · 67dc1596
      David S. Miller authored
      Daniel Borkmann says:
      
      ====================
      Add BPF numa id helper
      
      This patch set adds a helper for retrieving current numa node
      id and a test case for SO_REUSEPORT.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      67dc1596
    • Daniel Borkmann's avatar
      reuseport, bpf: add test case for bpf_get_numa_node_id · 3c2c3c16
      Daniel Borkmann authored
      The test case is very similar to reuseport_bpf_cpu, only that here
      we select socket members based on current numa node id.
      
        # numactl -H
        available: 2 nodes (0-1)
        node 0 cpus: 0 1 2 3 4 5 12 13 14 15 16 17
        node 0 size: 128867 MB
        node 0 free: 120080 MB
        node 1 cpus: 6 7 8 9 10 11 18 19 20 21 22 23
        node 1 size: 96765 MB
        node 1 free: 87504 MB
        node distances:
        node   0   1
          0:  10  20
          1:  20  10
      
        # ./reuseport_bpf_numa
        ---- IPv4 UDP ----
        send node 0, receive socket 0
        send node 1, receive socket 1
        send node 1, receive socket 1
        send node 0, receive socket 0
        ---- IPv6 UDP ----
        send node 0, receive socket 0
        send node 1, receive socket 1
        send node 1, receive socket 1
        send node 0, receive socket 0
        ---- IPv4 TCP ----
        send node 0, receive socket 0
        send node 1, receive socket 1
        send node 1, receive socket 1
        send node 0, receive socket 0
        ---- IPv6 TCP ----
        send node 0, receive socket 0
        send node 1, receive socket 1
        send node 1, receive socket 1
        send node 0, receive socket 0
        SUCCESS
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3c2c3c16
    • Daniel Borkmann's avatar
      bpf: add helper for retrieving current numa node id · 2d0e30c3
      Daniel Borkmann authored
      Use case is mainly for soreuseport to select sockets for the local
      numa node, but since generic, lets also add this for other networking
      and tracing program types.
      Suggested-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2d0e30c3
    • David S. Miller's avatar
      Merge branch 'udpmem' · a10b91b8
      David S. Miller authored
      Paolo Abeni says:
      
      ====================
      udp: refactor memory accounting
      
      This patch series refactor the udp memory accounting, replacing the
      generic implementation with a custom one, in order to remove the needs for
      locking the socket on the enqueue and dequeue operations. The socket backlog
      usage is dropped, as well.
      
      The first patch factor out pieces of some queue and memory management
      socket helpers, so that they can later be used by the udp memory accounting
      functions.
      The second patch adds the memory account helpers, without using them.
      The third patch replacse the old rx memory accounting path for udp over ipv4 and
      udp over ipv6. In kernel UDP users are updated, as well.
      
      The memory accounting schema is described in detail in the individual patch
      commit message.
      
      The performance gain depends on the specific scenario; with few flows (and
      little contention in the original code) the differences are in the noise range,
      while with several flows contending the same socket, the measured speed-up
      is relevant (e.g. even over 100% in case of extreme contention)
      
      Many thanks to Eric Dumazet for the reiterated reviews and suggestions.
      
      v5 -> v6:
       - do not orphan the skb on enqueue, skb_steal_sock() already did
         the work for us
      
      v4 -> v5:
       - use the receive queue spin lock to protect the memory accounting
       - several minor clean-up
      
      v3 -> v4:
       - simplified the locking schema, always use a plain spinlock
      
      v2 -> v3:
       - do not set the now unsed backlog_rcv callback
      
      v1 -> v2:
       - changed slighly the memory accounting schema, we now perform lazy reclaim
       - fixed forward_alloc updating issue
       - fixed memory counter integer overflows
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a10b91b8
    • Paolo Abeni's avatar
      udp: use it's own memory accounting schema · 850cbadd
      Paolo Abeni authored
      Completely avoid default sock memory accounting and replace it
      with udp-specific accounting.
      
      Since the new memory accounting model encapsulates completely
      the required locking, remove the socket lock on both enqueue and
      dequeue, and avoid using the backlog on enqueue.
      
      Be sure to clean-up rx queue memory on socket destruction, using
      udp its own sk_destruct.
      
      Tested using pktgen with random src port, 64 bytes packet,
      wire-speed on a 10G link as sender and udp_sink as the receiver,
      using an l4 tuple rxhash to stress the contention, and one or more
      udp_sink instances with reuseport.
      
      nr readers      Kpps (vanilla)  Kpps (patched)
      1               170             440
      3               1250            2150
      6               3000            3650
      9               4200            4450
      12              5700            6250
      
      v4 -> v5:
        - avoid unneeded test in first_packet_length
      
      v3 -> v4:
        - remove useless sk_rcvqueues_full() call
      
      v2 -> v3:
        - do not set the now unsed backlog_rcv callback
      
      v1 -> v2:
        - add memory pressure support
        - fixed dropwatch accounting for ipv6
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      850cbadd
    • Paolo Abeni's avatar
      udp: implement memory accounting helpers · f970bd9e
      Paolo Abeni authored
      Avoid using the generic helpers.
      Use the receive queue spin lock to protect the memory
      accounting operation, both on enqueue and on dequeue.
      
      On dequeue perform partial memory reclaiming, trying to
      leave a quantum of forward allocated memory.
      
      On enqueue use a custom helper, to allow some optimizations:
      - use a plain spin_lock() variant instead of the slightly
        costly spin_lock_irqsave(),
      - avoid dst_force check, since the calling code has already
        dropped the skb dst
      - avoid orphaning the skb, since skb_steal_sock() already did
        the work for us
      
      The above needs custom memory reclaiming on shutdown, provided
      by the udp_destruct_sock().
      
      v5 -> v6:
        - don't orphan the skb on enqueue
      
      v4 -> v5:
        - replace the mem_lock with the receive queue spin lock
        - ensure that the bh is always allowed to enqueue at least
          a skb, even if sk_rcvbuf is exceeded
      
      v3 -> v4:
        - reworked memory accunting, simplifying the schema
        - provide an helper for both memory scheduling and enqueuing
      
      v1 -> v2:
        - use a udp specific destrctor to perform memory reclaiming
        - remove a couple of helpers, unneeded after the above cleanup
        - do not reclaim memory on dequeue if not under memory
          pressure
        - reworked the fwd accounting schema to avoid potential
          integer overflow
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f970bd9e
    • Paolo Abeni's avatar
      net/socket: factor out helpers for memory and queue manipulation · f8c3bf00
      Paolo Abeni authored
      Basic sock operations that udp code can use with its own
      memory accounting schema. No functional change is introduced
      in the existing APIs.
      
      v4 -> v5:
        - avoid whitespace changes
      
      v2 -> v4:
        - avoid exporting __sock_enqueue_skb
      
      v1 -> v2:
        - avoid export sock_rmem_free
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f8c3bf00
  4. 21 Oct, 2016 2 commits
    • Jarod Wilson's avatar
      net: remove MTU limits on a few ether_setup callers · 8b1efc0f
      Jarod Wilson authored
      These few drivers call ether_setup(), but have no ndo_change_mtu, and thus
      were overlooked for changes to MTU range checking behavior. They
      previously had no range checks, so for feature-parity, set their min_mtu
      to 0 and max_mtu to ETH_MAX_MTU (65535), instead of the 68 and 1500
      inherited from the ether_setup() changes. Fine-tuning can come after we get
      back to full feature-parity here.
      
      CC: netdev@vger.kernel.org
      Reported-by: default avatarAsbjoern Sloth Toennesen <asbjorn@asbjorn.st>
      CC: Asbjoern Sloth Toennesen <asbjorn@asbjorn.st>
      CC: R Parameswaran <parameswaran.r7@gmail.com>
      Signed-off-by: default avatarJarod Wilson <jarod@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8b1efc0f
    • Vitaly Kuznetsov's avatar
      hv_netvsc: fix a race between netvsc_send() and netvsc_init_buf() · e8f0a89c
      Vitaly Kuznetsov authored
      Fix in commit 88098834 ("hv_netvsc: set nvdev link after populating
      chn_table") turns out to be incomplete. A crash in
      netvsc_get_next_send_section() is observed on mtu change when the device
      is under load. The race I identified is: if we get to netvsc_send() after
      we set net_device_ctx->nvdev link in netvsc_device_add() but before we
      finish netvsc_connect_vsp()->netvsc_init_buf() send_section_map is not
      allocated and we crash. Unfortunately we can't set net_device_ctx->nvdev
      link after the netvsc_init_buf() call as during the negotiation we need
      to receive packets and on the receive path we check for it. It would
      probably be possible to split nvdev into a pair of nvdev_in and nvdev_out
      links and check them accordingly in get_outbound_net_device()/
      get_inbound_net_device() but this looks like an overkill.
      
      Check that send_section_map is allocated in netvsc_send().
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e8f0a89c
  5. 20 Oct, 2016 6 commits
    • David S. Miller's avatar
      Merge branch 'MTU-core-range-checking-more' · 9e58c5dc
      David S. Miller authored
      Jarod Wilson says:
      
      ====================
      net: use core MTU range checking everywhere
      
      This stack of patches should get absolutely everything in the kernel
      converted from doing their own MTU range checking to the core MTU range
      checking. This second spin includes alterations to hopefully fix all
      concerns raised with the first, as well as including some additional
      changes to drivers and infrastructure where I completely missed necessary
      updates.
      
      These have all been built through the 0-day build infrastructure via the
      (rebasing) master branch at https://github.com/jarodwilson/linux-muck, which
      at the time of the most recent compile across 147 configs, was based on
      net-next at commit 7b1536ef.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9e58c5dc
    • Jarod Wilson's avatar
      ipv4/6: use core net MTU range checking · b96f9afe
      Jarod Wilson authored
      ipv4/ip_tunnel:
      - min_mtu = 68, max_mtu = 0xFFF8 - dev->hard_header_len - t_hlen
      - preserve all ndo_change_mtu checks for now to prevent regressions
      
      ipv6/ip6_tunnel:
      - min_mtu = 68, max_mtu = 0xFFF8 - dev->hard_header_len
      - preserve all ndo_change_mtu checks for now to prevent regressions
      
      ipv6/ip6_vti:
      - min_mtu = 1280, max_mtu = 65535
      - remove redundant vti6_change_mtu
      
      ipv6/sit:
      - min_mtu = 1280, max_mtu = 0xFFF8 - t_hlen
      - remove redundant ipip6_tunnel_change_mtu
      
      CC: netdev@vger.kernel.org
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      CC: James Morris <jmorris@namei.org>
      CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      CC: Patrick McHardy <kaber@trash.net>
      Signed-off-by: default avatarJarod Wilson <jarod@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b96f9afe
    • Jarod Wilson's avatar
      s390/net: use net core MTU range checking · 46b3ef4c
      Jarod Wilson authored
      ctcm:
      - min_mtu = 576, max_mtu = 65527
      
      netiucv:
      - min_mtu = 576, max_mtu = 65535
      
      qeth:
      - min_mtu = 64, max_mtu = 65535
      
      CC: netdev@vger.kernel.org
      CC: linux-s390@vger.kernel.org
      CC: Ursula Braun <ubraun@linux.vnet.ibm.com>
      Signed-off-by: default avatarJarod Wilson <jarod@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      46b3ef4c
    • Jarod Wilson's avatar
      net: use core MTU range checking in misc drivers · b3e3893e
      Jarod Wilson authored
      firewire-net:
      - set min/max_mtu
      - remove fwnet_change_mtu
      
      nes:
      - set max_mtu
      - clean up nes_netdev_change_mtu
      
      xpnet:
      - set min/max_mtu
      - remove xpnet_dev_change_mtu
      
      hippi:
      - set min/max_mtu
      - remove hippi_change_mtu
      
      batman-adv:
      - set max_mtu
      - remove batadv_interface_change_mtu
      - initialization is a little async, not 100% certain that max_mtu is set
        in the optimal place, don't have hardware to test with
      
      rionet:
      - set min/max_mtu
      - remove rionet_change_mtu
      
      slip:
      - set min/max_mtu
      - streamline sl_change_mtu
      
      um/net_kern:
      - remove pointless ndo_change_mtu
      
      hsi/clients/ssi_protocol:
      - use core MTU range checking
      - remove now redundant ssip_pn_set_mtu
      
      ipoib:
      - set a default max MTU value
      - Note: ipoib's actual max MTU can vary, depending on if the device is in
        connected mode or not, so we'll just set the max_mtu value to the max
        possible, and let the ndo_change_mtu function continue to validate any new
        MTU change requests with checks for CM or not. Note that ipoib has no
        min_mtu set, and thus, the network core's mtu > 0 check is the only lower
        bounds here.
      
      mptlan:
      - use net core MTU range checking
      - remove now redundant mpt_lan_change_mtu
      
      fddi:
      - min_mtu = 21, max_mtu = 4470
      - remove now redundant fddi_change_mtu (including export)
      
      fjes:
      - min_mtu = 8192, max_mtu = 65536
      - The max_mtu value is actually one over IP_MAX_MTU here, but the idea is to
        get past the core net MTU range checks so fjes_change_mtu can validate a
        new MTU against what it supports (see fjes_support_mtu in fjes_hw.c)
      
      hsr:
      - min_mtu = 0 (calls ether_setup, max_mtu is 1500)
      
      f_phonet:
      - min_mtu = 6, max_mtu = 65541
      
      u_ether:
      - min_mtu = 14, max_mtu = 15412
      
      phonet/pep-gprs:
      - min_mtu = 576, max_mtu = 65530
      - remove redundant gprs_set_mtu
      
      CC: netdev@vger.kernel.org
      CC: linux-rdma@vger.kernel.org
      CC: Stefan Richter <stefanr@s5r6.in-berlin.de>
      CC: Faisal Latif <faisal.latif@intel.com>
      CC: linux-rdma@vger.kernel.org
      CC: Cliff Whickman <cpw@sgi.com>
      CC: Robin Holt <robinmholt@gmail.com>
      CC: Jes Sorensen <jes@trained-monkey.org>
      CC: Marek Lindner <mareklindner@neomailbox.ch>
      CC: Simon Wunderlich <sw@simonwunderlich.de>
      CC: Antonio Quartulli <a@unstable.cc>
      CC: Sathya Prakash <sathya.prakash@broadcom.com>
      CC: Chaitra P B <chaitra.basappa@broadcom.com>
      CC: Suganath Prabu Subramani <suganath-prabu.subramani@broadcom.com>
      CC: MPT-FusionLinux.pdl@broadcom.com
      CC: Sebastian Reichel <sre@kernel.org>
      CC: Felipe Balbi <balbi@kernel.org>
      CC: Arvid Brodin <arvid.brodin@alten.se>
      CC: Remi Denis-Courmont <courmisch@gmail.com>
      Signed-off-by: default avatarJarod Wilson <jarod@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b3e3893e
    • Jarod Wilson's avatar
      net: use core MTU range checking in virt drivers · d0c2c997
      Jarod Wilson authored
      hyperv_net:
      - set min/max_mtu, per Haiyang, after rndis_filter_device_add
      
      virtio_net:
      - set min/max_mtu
      - remove virtnet_change_mtu
      
      vmxnet3:
      - set min/max_mtu
      
      xen-netback:
      - min_mtu = 0, max_mtu = 65517
      
      xen-netfront:
      - min_mtu = 0, max_mtu = 65535
      
      unisys/visor:
      - clean up defines a little to not clash with network core or add
        redundat definitions
      
      CC: netdev@vger.kernel.org
      CC: virtualization@lists.linux-foundation.org
      CC: "K. Y. Srinivasan" <kys@microsoft.com>
      CC: Haiyang Zhang <haiyangz@microsoft.com>
      CC: "Michael S. Tsirkin" <mst@redhat.com>
      CC: Shrikrishna Khare <skhare@vmware.com>
      CC: "VMware, Inc." <pv-drivers@vmware.com>
      CC: Wei Liu <wei.liu2@citrix.com>
      CC: Paul Durrant <paul.durrant@citrix.com>
      CC: David Kershner <david.kershner@unisys.com>
      Signed-off-by: default avatarJarod Wilson <jarod@redhat.com>
      Reviewed-by: default avatarHaiyang Zhang <haiyangz@microsoft.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d0c2c997
    • Jarod Wilson's avatar
      net: use core MTU range checking in core net infra · 91572088
      Jarod Wilson authored
      geneve:
      - Merge __geneve_change_mtu back into geneve_change_mtu, set max_mtu
      - This one isn't quite as straight-forward as others, could use some
        closer inspection and testing
      
      macvlan:
      - set min/max_mtu
      
      tun:
      - set min/max_mtu, remove tun_net_change_mtu
      
      vxlan:
      - Merge __vxlan_change_mtu back into vxlan_change_mtu
      - Set max_mtu to IP_MAX_MTU and retain dynamic MTU range checks in
        change_mtu function
      - This one is also not as straight-forward and could use closer inspection
        and testing from vxlan folks
      
      bridge:
      - set max_mtu of IP_MAX_MTU and retain dynamic MTU range checks in
        change_mtu function
      
      openvswitch:
      - set min/max_mtu, remove internal_dev_change_mtu
      - note: max_mtu wasn't checked previously, it's been set to 65535, which
        is the largest possible size supported
      
      sch_teql:
      - set min/max_mtu (note: max_mtu previously unchecked, used max of 65535)
      
      macsec:
      - min_mtu = 0, max_mtu = 65535
      
      macvlan:
      - min_mtu = 0, max_mtu = 65535
      
      ntb_netdev:
      - min_mtu = 0, max_mtu = 65535
      
      veth:
      - min_mtu = 68, max_mtu = 65535
      
      8021q:
      - min_mtu = 0, max_mtu = 65535
      
      CC: netdev@vger.kernel.org
      CC: Nicolas Dichtel <nicolas.dichtel@6wind.com>
      CC: Hannes Frederic Sowa <hannes@stressinduktion.org>
      CC: Tom Herbert <tom@herbertland.com>
      CC: Daniel Borkmann <daniel@iogearbox.net>
      CC: Alexander Duyck <alexander.h.duyck@intel.com>
      CC: Paolo Abeni <pabeni@redhat.com>
      CC: Jiri Benc <jbenc@redhat.com>
      CC: WANG Cong <xiyou.wangcong@gmail.com>
      CC: Roopa Prabhu <roopa@cumulusnetworks.com>
      CC: Pravin B Shelar <pshelar@ovn.org>
      CC: Sabrina Dubroca <sd@queasysnail.net>
      CC: Patrick McHardy <kaber@trash.net>
      CC: Stephen Hemminger <stephen@networkplumber.org>
      CC: Pravin Shelar <pshelar@nicira.com>
      CC: Maxim Krasnyansky <maxk@qti.qualcomm.com>
      Signed-off-by: default avatarJarod Wilson <jarod@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      91572088