1. 08 Nov, 2018 18 commits
    • Paolo Abeni's avatar
      selftests: add dummy xdp test helper · bd8e1afe
      Paolo Abeni authored
      This trivial XDP program does nothing, but will be used by the
      next patch to test the GRO path in a net namespace, leveraging
      the veth XDP implementation.
      
      It's added here, despite its 'net' usage, to avoid the duplication
      of the llc-related makefile boilerplate.
      
      rfc v3 -> v1:
       - move the helper implementation into the bpf directory, don't
         touch udpgso_bench_rx
      
      rfc v2 -> rfc v3:
       - move 'x' option handling here
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bd8e1afe
    • Paolo Abeni's avatar
      selftests: add GRO support to udp bench rx program · 0a9ac2e9
      Paolo Abeni authored
      And fix a couple of buglets (port option processing,
      clean termination on SIGINT). This is preparatory work
      for GRO tests.
      
      rfc v2 -> rfc v3:
       - use ETH_MAX_MTU
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0a9ac2e9
    • Paolo Abeni's avatar
      udp: cope with UDP GRO packet misdirection · cf329aa4
      Paolo Abeni authored
      In some scenarios, the GRO engine can assemble an UDP GRO packet
      that ultimately lands on a non GRO-enabled socket.
      This patch tries to address the issue explicitly checking for the UDP
      socket features before enqueuing the packet, and eventually segmenting
      the unexpected GRO packet, as needed.
      
      We must also cope with re-insertion requests: after segmentation the
      UDP code calls the helper introduced by the previous patches, as needed.
      
      Segmentation is performed by a common helper, which takes care of
      updating socket and protocol stats is case of failure.
      
      rfc v3 -> v1
       - fix compile issues with rxrpc
       - when gso_segment returns NULL, treat is as an error
       - added 'ipv4' argument to udp_rcv_segment()
      
      rfc v2 -> rfc v3
       - moved udp_rcv_segment() into net/udp.h, account errors to socket
         and ns, always return NULL or segs list
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cf329aa4
    • Paolo Abeni's avatar
      ipv6: factor out protocol delivery helper · 80bde363
      Paolo Abeni authored
      So that we can re-use it at the UDP level in the next patch
      
      rfc v3 -> v1:
       - add the helper declaration into the ipv6 header
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      80bde363
    • Paolo Abeni's avatar
      ip: factor out protocol delivery helper · 68cb7d53
      Paolo Abeni authored
      So that we can re-use it at the UDP level in a later patch
      
      rfc v3 -> v1
       - add the helper declaration into the ip header
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      68cb7d53
    • Paolo Abeni's avatar
      udp: add support for UDP_GRO cmsg · bcd1665e
      Paolo Abeni authored
      When UDP GRO is enabled, the UDP_GRO cmsg will carry the ingress
      datagram size. User-space can use such info to compute the original
      packets layout.
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bcd1665e
    • Paolo Abeni's avatar
      udp: implement GRO for plain UDP sockets. · e20cf8d3
      Paolo Abeni authored
      This is the RX counterpart of commit bec1f6f6 ("udp: generate gso
      with UDP_SEGMENT"). When UDP_GRO is enabled, such socket is also
      eligible for GRO in the rx path: UDP segments directed to such socket
      are assembled into a larger GSO_UDP_L4 packet.
      
      The core UDP GRO support is enabled with setsockopt(UDP_GRO).
      
      Initial benchmark numbers:
      
      Before:
      udp rx:   1079 MB/s   769065 calls/s
      
      After:
      udp rx:   1466 MB/s    24877 calls/s
      
      This change introduces a side effect in respect to UDP tunnels:
      after a UDP tunnel creation, now the kernel performs a lookup per ingress
      UDP packet, while before such lookup happened only if the ingress packet
      carried a valid internal header csum.
      
      rfc v2 -> rfc v3:
       - fixed typos in macro name and comments
       - really enforce UDP_GRO_CNT_MAX, instead of UDP_GRO_CNT_MAX + 1
       - acquire socket lock in UDP_GRO setsockopt
      
      rfc v1 -> rfc v2:
       - use a new option to enable UDP GRO
       - use static keys to protect the UDP GRO socket lookup
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e20cf8d3
    • Paolo Abeni's avatar
      udp: implement complete book-keeping for encap_needed · 60fb9567
      Paolo Abeni authored
      The *encap_needed static keys are enabled by UDP tunnels
      and several UDP encapsulations type, but they are never
      turned off. This can cause unneeded overall performance
      degradation for systems where such features are used
      transiently.
      
      This patch introduces complete book-keeping for such keys,
      decreasing the usage at socket destruction time, if needed,
      and avoiding that the same socket could increase the key
      usage multiple times.
      
      rfc v3 -> v1:
       - add socket lock around udp_tunnel_encap_enable()
      
      rfc v2 -> rfc v3:
       - use udp_tunnel_encap_enable() in setsockopt()
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      60fb9567
    • David S. Miller's avatar
      Merge branch 'vrf-allow-simultaneous-service-instances-in-default-and-other-VRFs' · 7e225619
      David S. Miller authored
      Mike Manning says:
      
      ====================
      vrf: allow simultaneous service instances in default and other VRFs
      
      Services currently have to be VRF-aware if they are using an unbound
      socket. One cannot have multiple service instances running in the
      default and other VRFs for services that are not VRF-aware and listen
      on an unbound socket. This is because there is no easy way of isolating
      packets received in the default VRF from those arriving in other VRFs.
      
      This series provides this isolation for stream sockets subject to the
      existing kernel parameter net.ipv4.tcp_l3mdev_accept not being set,
      given that this is documented as allowing a single service instance to
      work across all VRF domains. Similarly, net.ipv4.udp_l3mdev_accept is
      checked for datagram sockets, and net.ipv4.raw_l3mdev_accept is
      introduced for raw sockets. The functionality applies to UDP & TCP
      services as well as those using raw sockets, and is for IPv4 and IPv6.
      
      Example of running ssh instances in default and blue VRF:
      
      $ /usr/sbin/sshd -D
      $ ip vrf exec vrf-blue /usr/sbin/sshd
      $ ss -ta | egrep 'State|ssh'
      State   Recv-Q   Send-Q           Local Address:Port       Peer Address:Port
      LISTEN  0        128           0.0.0.0%vrf-blue:ssh             0.0.0.0:*
      LISTEN  0        128                    0.0.0.0:ssh             0.0.0.0:*
      ESTAB   0        0              192.168.122.220:ssh       192.168.122.1:50282
      LISTEN  0        128              [::]%vrf-blue:ssh                [::]:*
      LISTEN  0        128                       [::]:ssh                [::]:*
      ESTAB   0        0           [3000::2]%vrf-blue:ssh           [3000::9]:45896
      ESTAB   0        0                    [2000::2]:ssh           [2000::9]:46398
      
      v1:
         - Address Paolo Abeni's comments (patch 4/5)
         - Fix build when CONFIG_NET_L3_MASTER_DEV not defined (patch 1/5)
      v2:
         - Address David Aherns' comments (patches 4/5 and 5/5)
         - Remove patches 3/5 and 5/5 from series for individual submissions
         - Include a sysctl for raw sockets as recommended by David Ahern
         - Expand series into 10 patches and provide improved descriptions
      v3:
         - Update description for patch 1/10 and remove patch 6/10
      v4:
         - Set default to enabled for raw socket sysctl as recommended by David Ahern
      v5:
         - Address review comments from David Ahern in patches 2-5
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7e225619
    • Dewi Morgan's avatar
      ipv6: do not drop vrf udp multicast packets · 7bd2db40
      Dewi Morgan authored
      For bound udp sockets in a vrf, also check the sdif to get the index
      for ingress devices enslaved to an l3mdev.
      Signed-off-by: default avatarDewi Morgan <morgand@vyatta.att-mail.com>
      Signed-off-by: default avatarMike Manning <mmanning@vyatta.att-mail.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@gmail.com>
      Tested-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7bd2db40
    • Mike Manning's avatar
      ipv6: handling of multicast packets received in VRF · 5226b6a9
      Mike Manning authored
      If the skb for multicast packets marked as enslaved to a VRF are
      received, then the secondary device index should be used to obtain
      the real device. And verify the multicast address against the
      enslaved rather than the l3mdev device.
      Signed-off-by: default avatarDewi Morgan <morgand@vyatta.att-mail.com>
      Signed-off-by: default avatarMike Manning <mmanning@vyatta.att-mail.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@gmail.com>
      Tested-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5226b6a9
    • Mike Manning's avatar
      ipv6: allow ping to link-local address in VRF · d839a0eb
      Mike Manning authored
      If link-local packets are marked as enslaved to a VRF, then to allow
      ping to the link-local from a vrf, the error handling for IPV6_PKTINFO
      needs to be relaxed to also allow the pkt ipi6_ifindex to be that of a
      slave device to the vrf.
      
      Note that the real device also needs to be retrieved in icmp6_iif()
      to set the ipv6 flow oif to this for icmp echo reply handling. The
      recent commit 24b711ed ("net/ipv6: Fix linklocal to global address
      with VRF") takes care of this, so the sdif does not need checking here.
      
      This fix makes ping to link-local consistent with that to global
      addresses, in that this can now be done from within the same VRF that
      the address is in.
      Signed-off-by: default avatarMike Manning <mmanning@vyatta.att-mail.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@gmail.com>
      Tested-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d839a0eb
    • Mike Manning's avatar
      vrf: mark skb for multicast or link-local as enslaved to VRF · 6f12fa77
      Mike Manning authored
      The skb for packets that are multicast or to a link-local address are
      not marked as being enslaved to a VRF, if they are received on a socket
      bound to the VRF. This is needed for ND and it is preferable for the
      kernel not to have to deal with the additional use-cases if ll or mcast
      packets are handled as enslaved. However, this does not allow service
      instances listening on unbound and bound to VRF sockets to distinguish
      the VRF used, if packets are sent as multicast or to a link-local
      address. The fix is for the VRF driver to also mark these skb as being
      enslaved to the VRF.
      Signed-off-by: default avatarMike Manning <mmanning@vyatta.att-mail.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@gmail.com>
      Tested-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6f12fa77
    • Duncan Eastoe's avatar
      net: fix raw socket lookup device bind matching with VRFs · 7055420f
      Duncan Eastoe authored
      When there exist a pair of raw sockets one unbound and one bound
      to a VRF but equal in all other respects, when a packet is received
      in the VRF context, __raw_v4_lookup() matches on both sockets.
      
      This results in the packet being delivered over both sockets,
      instead of only the raw socket bound to the VRF. The bound device
      checks in __raw_v4_lookup() are replaced with a call to
      raw_sk_bound_dev_eq() which correctly handles whether the packet
      should be delivered over the unbound socket in such cases.
      
      In __raw_v6_lookup() the match on the device binding of the socket is
      similarly updated to use raw_sk_bound_dev_eq() which matches the
      handling in __raw_v4_lookup().
      
      Importantly raw_sk_bound_dev_eq() takes the raw_l3mdev_accept sysctl
      into account.
      Signed-off-by: default avatarDuncan Eastoe <deastoe@vyatta.att-mail.com>
      Signed-off-by: default avatarMike Manning <mmanning@vyatta.att-mail.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@gmail.com>
      Tested-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7055420f
    • Mike Manning's avatar
      net: provide a sysctl raw_l3mdev_accept for raw socket lookup with VRFs · 6897445f
      Mike Manning authored
      Add a sysctl raw_l3mdev_accept to control raw socket lookup in a manner
      similar to use of tcp_l3mdev_accept for stream and of udp_l3mdev_accept
      for datagram sockets. Have this default to enabled for reasons of
      backwards compatibility. This is so as to specify the output device
      with cmsg and IP_PKTINFO, but using a socket not bound to the
      corresponding VRF. This allows e.g. older ping implementations to be
      run with specifying the device but without executing it in the VRF.
      If the option is disabled, packets received in a VRF context are only
      handled by a raw socket bound to the VRF, and correspondingly packets
      in the default VRF are only handled by a socket not bound to any VRF.
      Signed-off-by: default avatarMike Manning <mmanning@vyatta.att-mail.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@gmail.com>
      Tested-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6897445f
    • Mike Manning's avatar
      net: ensure unbound datagram socket to be chosen when not in a VRF · 6da5b0f0
      Mike Manning authored
      Ensure an unbound datagram skt is chosen when not in a VRF. The check
      for a device match in compute_score() for UDP must be performed when
      there is no device match. For this, a failure is returned when there is
      no device match. This ensures that bound sockets are never selected,
      even if there is no unbound socket.
      
      Allow IPv6 packets to be sent over a datagram skt bound to a VRF. These
      packets are currently blocked, as flowi6_oif was set to that of the
      master vrf device, and the ipi6_ifindex is that of the slave device.
      Allow these packets to be sent by checking the device with ipi6_ifindex
      has the same L3 scope as that of the bound device of the skt, which is
      the master vrf device. Note that this check always succeeds if the skt
      is unbound.
      
      Even though the right datagram skt is now selected by compute_score(),
      a different skt is being returned that is bound to the wrong vrf. The
      difference between these and stream sockets is the handling of the skt
      option for SO_REUSEPORT. While the handling when adding a skt for reuse
      correctly checks that the bound device of the skt is a match, the skts
      in the hashslot are already incorrect. So for the same hash, a skt for
      the wrong vrf may be selected for the required port. The root cause is
      that the skt is immediately placed into a slot when it is created,
      but when the skt is then bound using SO_BINDTODEVICE, it remains in the
      same slot. The solution is to move the skt to the correct slot by
      forcing a rehash.
      Signed-off-by: default avatarMike Manning <mmanning@vyatta.att-mail.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@gmail.com>
      Tested-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6da5b0f0
    • Mike Manning's avatar
      net: ensure unbound stream socket to be chosen when not in a VRF · e7819058
      Mike Manning authored
      The commit a04a480d ("net: Require exact match for TCP socket
      lookups if dif is l3mdev") only ensures that the correct socket is
      selected for packets in a VRF. However, there is no guarantee that
      the unbound socket will be selected for packets when not in a VRF.
      By checking for a device match in compute_score() also for the case
      when there is no bound device and attaching a score to this, the
      unbound socket is selected. And if a failure is returned when there
      is no device match, this ensures that bound sockets are never selected,
      even if there is no unbound socket.
      Signed-off-by: default avatarMike Manning <mmanning@vyatta.att-mail.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@gmail.com>
      Tested-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e7819058
    • Robert Shearman's avatar
      net: allow binding socket in a VRF when there's an unbound socket · 3c82a21f
      Robert Shearman authored
      Change the inet socket lookup to avoid packets arriving on a device
      enslaved to an l3mdev from matching unbound sockets by removing the
      wildcard for non sk_bound_dev_if and instead relying on check against
      the secondary device index, which will be 0 when the input device is
      not enslaved to an l3mdev and so match against an unbound socket and
      not match when the input device is enslaved.
      
      Change the socket binding to take the l3mdev into account to allow an
      unbound socket to not conflict sockets bound to an l3mdev given the
      datapath isolation now guaranteed.
      Signed-off-by: default avatarRobert Shearman <rshearma@vyatta.att-mail.com>
      Signed-off-by: default avatarMike Manning <mmanning@vyatta.att-mail.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@gmail.com>
      Tested-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3c82a21f
  2. 07 Nov, 2018 22 commits