1. 07 Aug, 2017 3 commits
  2. 29 Jul, 2017 1 commit
    • Paolo Abeni's avatar
      udp6: fix socket leak on early demux · c9f2c1ae
      Paolo Abeni authored
      
      When an early demuxed packet reaches __udp6_lib_lookup_skb(), the
      sk reference is retrieved and used, but the relevant reference
      count is leaked and the socket destructor is never called.
      Beyond leaking the sk memory, if there are pending UDP packets
      in the receive queue, even the related accounted memory is leaked.
      
      In the long run, this will cause persistent forward allocation errors
      and no UDP skbs (both ipv4 and ipv6) will be able to reach the
      user-space.
      
      Fix this by explicitly accessing the early demux reference before
      the lookup, and properly decreasing the socket reference count
      after usage.
      
      Also drop the skb_steal_sock() in __udp6_lib_lookup_skb(), and
      the now obsoleted comment about "socket cache".
      
      The newly added code is derived from the current ipv4 code for the
      similar path.
      
      v1 -> v2:
        fixed the __udp6_lib_rcv() return code for resubmission,
        as suggested by Eric
      Reported-by: default avatarSam Edwards <CFSworks@gmail.com>
      Reported-by: default avatarMarc Haber <mh+netdev@zugschlus.de>
      Fixes: 5425077d
      
       ("net: ipv6: Add early demux handler for UDP unicast")
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c9f2c1ae
  3. 26 Jul, 2017 1 commit
  4. 25 Jul, 2017 1 commit
    • Paolo Abeni's avatar
      udp: preserve head state for IP_CMSG_PASSSEC · dce4551c
      Paolo Abeni authored
      Paul Moore reported a SELinux/IP_PASSSEC regression
      caused by missing skb->sp at recvmsg() time. We need to
      preserve the skb head state to process the IP_CMSG_PASSSEC
      cmsg.
      
      With this commit we avoid releasing the skb head state in the
      BH even if a secpath is attached to the current skb, and stores
      the skb status (with/without head states) in the scratch area,
      so that we can access it at skb deallocation time, without
      incurring in cache-miss penalties.
      
      This also avoids misusing the skb CB for ipv6 packets,
      as introduced by the commit 0ddf3fb2
      
       ("udp: preserve
      skb->dst if required for IP options processing").
      
      Clean a bit the scratch area helpers implementation, to
      reduce the code differences between 32 and 64 bits build.
      Reported-by: default avatarPaul Moore <paul@paul-moore.com>
      Fixes: 0a463c78 ("udp: avoid a cache miss on dequeue")
      Fixes: 0ddf3fb2
      
       ("udp: preserve skb->dst if required for IP options processing")
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Tested-by: default avatarPaul Moore <paul@paul-moore.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dce4551c
  5. 18 Jul, 2017 1 commit
  6. 01 Jul, 2017 1 commit
  7. 27 Jun, 2017 1 commit
  8. 23 Jun, 2017 1 commit
    • Paolo Abeni's avatar
      udp: fix poll() · 9bd780f5
      Paolo Abeni authored
      Michael reported an UDP breakage caused by the commit b65ac446
      ("udp: try to avoid 2 cache miss on dequeue").
      The function __first_packet_length() can update the checksum bits
      of the pending skb, making the scratched area out-of-sync, and
      setting skb->csum, if the skb was previously in need of checksum
      validation.
      
      On later recvmsg() for such skb, checksum validation will be
      invoked again - due to the wrong udp_skb_csum_unnecessary()
      value - and will fail, causing the valid skb to be dropped.
      
      This change addresses the issue refreshing the scratch area in
      __first_packet_length() after the possible checksum update.
      
      Fixes: b65ac446
      
       ("udp: try to avoid 2 cache miss on dequeue")
      Reported-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9bd780f5
  9. 21 Jun, 2017 1 commit
  10. 18 Jun, 2017 1 commit
  11. 12 Jun, 2017 2 commits
    • Paolo Abeni's avatar
      udp: try to avoid 2 cache miss on dequeue · b65ac446
      Paolo Abeni authored
      
      when udp_recvmsg() is executed, on x86_64 and other archs, most skb
      fields are on cold cachelines.
      If the skb are linear and the kernel don't need to compute the udp
      csum, only a handful of skb fields are required by udp_recvmsg().
      Since we already use skb->dev_scratch to cache hot data, and
      there are 32 bits unused on 64 bit archs, use such field to cache
      as much data as we can, and try to prefetch on dequeue the relevant
      fields that are left out.
      
      This can save up to 2 cache miss per packet.
      
      v1 -> v2:
        - changed udp_dev_scratch fields types to u{32,16} variant,
          replaced bitfiled with bool
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b65ac446
    • Paolo Abeni's avatar
      udp: avoid a cache miss on dequeue · 0a463c78
      Paolo Abeni authored
      
      Since UDP no more uses sk->destructor, we can clear completely
      the skb head state before enqueuing. Amend and use
      skb_release_head_state() for that.
      
      All head states share a single cacheline, which is not
      normally used/accesses on dequeue. We can avoid entirely accessing
      such cacheline implementing and using in the UDP code a specialized
      skb free helper which ignores the skb head state.
      
      This saves a cacheline miss at skb deallocation time.
      
      v1 -> v2:
        replaced secpath_reset() with skb_release_head_state()
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0a463c78
  12. 18 May, 2017 3 commits
  13. 16 May, 2017 2 commits
  14. 07 Feb, 2017 2 commits
  15. 30 Jan, 2017 1 commit
    • Robert Shearman's avatar
      net: Avoid receiving packets with an l3mdev on unbound UDP sockets · 63a6fff3
      Robert Shearman authored
      
      Packets arriving in a VRF currently are delivered to UDP sockets that
      aren't bound to any interface. TCP defaults to not delivering packets
      arriving in a VRF to unbound sockets. IP route lookup and socket
      transmit both assume that unbound means using the default table and
      UDP applications that haven't been changed to be aware of VRFs may not
      function correctly in this case since they may not be able to handle
      overlapping IP address ranges, or be able to send packets back to the
      original sender if required.
      
      So add a sysctl, udp_l3mdev_accept, to control this behaviour with it
      being analgous to the existing tcp_l3mdev_accept, namely to allow a
      process to have a VRF-global listen socket. Have this default to off
      as this is the behaviour that users will expect, given that there is
      no explicit mechanism to set unmodified VRF-unaware application into a
      default VRF.
      Signed-off-by: default avatarRobert Shearman <rshearma@brocade.com>
      Acked-by: David Ahern <ds...
      63a6fff3
  16. 18 Jan, 2017 1 commit
  17. 07 Jan, 2017 1 commit
    • Eric Garver's avatar
      udp: inuse checks can quit early for reuseport · df560056
      Eric Garver authored
      
      UDP lib inuse checks will walk the entire hash bucket to check if the
      portaddr is in use. In the case of reuseport we can stop searching when
      we find a matching reuseport.
      
      On a 16-core VM a test program that spawns 16 threads that each bind to
      1024 sockets (one per 10ms) takes 1m45s. With this change it takes 11s.
      
      Also add a cond_resched() when the port is not specified.
      Signed-off-by: default avatarEric Garver <e@erig.me>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      df560056
  18. 24 Dec, 2016 1 commit
  19. 10 Dec, 2016 4 commits
    • Eric Dumazet's avatar
      udp: udp_rmem_release() should touch sk_rmem_alloc later · 02ab0d13
      Eric Dumazet authored
      
      In flood situations, keeping sk_rmem_alloc at a high value
      prevents producers from touching the socket.
      
      It makes sense to lower sk_rmem_alloc only at the end
      of udp_rmem_release() after the thread draining receive
      queue in udp_recvmsg() finished the writes to sk_forward_alloc.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      02ab0d13
    • Eric Dumazet's avatar
      udp: add batching to udp_rmem_release() · 6b229cf7
      Eric Dumazet authored
      
      If udp_recvmsg() constantly releases sk_rmem_alloc
      for every read packet, it gives opportunity for
      producers to immediately grab spinlocks and desperatly
      try adding another packet, causing false sharing.
      
      We can add a simple heuristic to give the signal
      by batches of ~25 % of the queue capacity.
      
      This patch considerably increases performance under
      flood by about 50 %, since the thread draining the queue
      is no longer slowed by false sharing.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6b229cf7
    • Eric Dumazet's avatar
      udp: copy skb->truesize in the first cache line · c84d9490
      Eric Dumazet authored
      
      In UDP RX handler, we currently clear skb->dev before skb
      is added to receive queue, because device pointer is no longer
      available once we exit from RCU section.
      
      Since this first cache line is always hot, lets reuse this space
      to store skb->truesize and thus avoid a cache line miss at
      udp_recvmsg()/udp_skb_destructor time while receive queue
      spinlock is held.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c84d9490
    • Eric Dumazet's avatar
      udp: add busylocks in RX path · 4b272750
      Eric Dumazet authored
      
      Idea of busylocks is to let producers grab an extra spinlock
      to relieve pressure on the receive_queue spinlock shared by consumer.
      
      This behavior is requested only once socket receive queue is above
      half occupancy.
      
      Under flood, this means that only one producer can be in line
      trying to acquire the receive_queue spinlock.
      
      These busylock can be allocated on a per cpu manner, instead of a
      per socket one (that would consume a cache line per socket)
      
      This patch considerably improves UDP behavior under stress,
      depending on number of NIC RX queues and/or RPS spread.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4b272750
  20. 08 Dec, 2016 1 commit
    • Eric Dumazet's avatar
      udp: under rx pressure, try to condense skbs · c8c8b127
      Eric Dumazet authored
      
      Under UDP flood, many softirq producers try to add packets to
      UDP receive queue, and one user thread is burning one cpu trying
      to dequeue packets as fast as possible.
      
      Two parts of the per packet cost are :
      - copying payload from kernel space to user space,
      - freeing memory pieces associated with skb.
      
      If socket is under pressure, softirq handler(s) can try to pull in
      skb->head the payload of the packet if it fits.
      
      Meaning the softirq handler(s) can free/reuse the page fragment
      immediately, instead of letting udp_recvmsg() do this hundreds of usec
      later, possibly from another node.
      
      Additional gains :
      - We reduce skb->truesize and thus can store more packets per SO_RCVBUF
      - We avoid cache line misses at copyout() time and consume_skb() time,
      and avoid one put_page() with potential alien freeing on NUMA hosts.
      
      This comes at the cost of a copy, bounded to available tail room, which
      is usually small. (We might have to fix GRO_MAX_HEAD which looks bigger
      than necessary)
      
      This patch gave me about 5 % increase in throughput in my tests.
      
      skb_condense() helper could probably used in other contexts.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c8c8b127
  21. 03 Dec, 2016 1 commit
  22. 24 Nov, 2016 1 commit
    • Eric Dumazet's avatar
      udplite: call proper backlog handlers · 30c7be26
      Eric Dumazet authored
      In commits 93821778 ("udp: Fix rcv socket locking") and
      f7ad74fe ("net/ipv6/udp: UDP encapsulation: break backlog_rcv into
      __udpv6_queue_rcv_skb") UDP backlog handlers were renamed, but UDPlite
      was forgotten.
      
      This leads to crashes if UDPlite header is pulled twice, which happens
      starting from commit e6afc8ac ("udp: remove headers from UDP packets
      before queueing")
      
      Bug found by syzkaller team, thanks a lot guys !
      
      Note that backlog use in UDP/UDPlite is scheduled to be removed starting
      from linux-4.10, so this patch is only needed up to linux-4.9
      
      Fixes: 93821778 ("udp: Fix rcv socket locking")
      Fixes: f7ad74fe ("net/ipv6/udp: UDP encapsulation: break backlog_rcv into __udpv6_queue_rcv_skb")
      Fixes: e6afc8ac
      
       ("udp: remove headers from UDP packets before queueing")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: Herbert Xu <herbert@gondor.apana.org....
      30c7be26
  23. 21 Nov, 2016 1 commit
  24. 18 Nov, 2016 1 commit
    • Eric Dumazet's avatar
      udp: enable busy polling for all sockets · e68b6e50
      Eric Dumazet authored
      
      UDP busy polling is restricted to connected UDP sockets.
      
      This is because sk_busy_loop() only takes care of one NAPI context.
      
      There are cases where it could be extended.
      
      1) Some hosts receive traffic on a single NIC, with one RX queue.
      
      2) Some applications use SO_REUSEPORT and associated BPF filter
         to split the incoming traffic on one UDP socket per RX
      queue/thread/cpu
      
      3) Some UDP sockets are used to send/receive traffic for one flow, but
      they do not bother with connect()
      
      This patch records the napi_id of first received skb, giving more
      reach to busy polling.
      
      Tested:
      
      lpaa23:~# echo 70 >/proc/sys/net/core/busy_read
      lpaa24:~# echo 70 >/proc/sys/net/core/busy_read
      
      lpaa23:~# for f in `seq 1 10`; do ./super_netperf 1 -H lpaa24 -t UDP_RR -l 5; done
      
      Before patch :
         27867   28870   37324   41060   41215
         36764   36838   44455   41282   43843
      After patch :
         73920   73213   70147   74845   71697
         68315   68028   75219   70082   73707
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e68b6e50
  25. 16 Nov, 2016 1 commit
  26. 15 Nov, 2016 1 commit
  27. 09 Nov, 2016 1 commit
    • Arnd Bergmann's avatar
      udp: provide udp{4,6}_lib_lookup for nf_socket_ipv{4,6} · 30f58158
      Arnd Bergmann authored
      Since commit ca065d0c ("udp: no longer use SLAB_DESTROY_BY_RCU")
      the udp6_lib_lookup and udp4_lib_lookup functions are only
      provided when it is actually possible to call them.
      
      However, moving the callers now caused a link error:
      
      net/built-in.o: In function `nf_sk_lookup_slow_v6':
      (.text+0x131a39): undefined reference to `udp6_lib_lookup'
      net/ipv4/netfilter/nf_socket_ipv4.o: In function `nf_sk_lookup_slow_v4':
      nf_socket_ipv4.c:(.text.nf_sk_lookup_slow_v4+0x114): undefined reference to `udp4_lib_lookup'
      
      This extends the #ifdef so we also provide the functions when
      CONFIG_NF_SOCKET_IPV4 or CONFIG_NF_SOCKET_IPV6, respectively
      are set.
      
      Fixes: 8db4c5be
      
       ("netfilter: move socket lookup infrastructure to nf_socket_ipv{4,6}.c")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      30f58158
  28. 07 Nov, 2016 2 commits
    • Paolo Abeni's avatar
      udp: do fwd memory scheduling on dequeue · 7c13f97f
      Paolo Abeni authored
      A new argument is added to __skb_recv_datagram to provide
      an explicit skb destructor, invoked under the receive queue
      lock.
      The UDP protocol uses such argument to perform memory
      reclaiming on dequeue, so that the UDP protocol does not
      set anymore skb->desctructor.
      Instead explicit memory reclaiming is performed at close() time and
      when skbs are removed from the receive queue.
      The in kernel UDP protocol users now need to call a
      skb_recv_udp() variant instead of skb_recv_datagram() to
      properly perform memory accounting on dequeue.
      
      Overall, this allows acquiring only once the receive queue
      lock on dequeue.
      
      Tested using pktgen with random src port, 64 bytes packet,
      wire-speed on a 10G link as sender and udp_sink as the receiver,
      using an l4 tuple rxhash to stress the contention, and one or more
      udp_sink instances with reuseport.
      
      nr sinks	vanilla		patched
      1		440		560
      3		2150		2300
      6		3650		3800
      9		4450		4600
      12		6250		6450
      
      v1 -> v2:
       - do rmem and allocated memory s...
      7c13f97f
    • Paolo Abeni's avatar
      net/sock: add an explicit sk argument for ip_cmsg_recv_offset() · ad959036
      Paolo Abeni authored
      
      So that we can use it even after orphaining the skbuff.
      Suggested-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ad959036
  29. 04 Nov, 2016 1 commit
    • Lorenzo Colitti's avatar
      net: inet: Support UID-based routing in IP protocols. · e2d118a1
      Lorenzo Colitti authored
      - Use the UID in routing lookups made by protocol connect() and
        sendmsg() functions.
      - Make sure that routing lookups triggered by incoming packets
        (e.g., Path MTU discovery) take the UID of the socket into
        account.
      - For packets not associated with a userspace socket, (e.g., ping
        replies) use UID 0 inside the user namespace corresponding to
        the network namespace the socket belongs to. This allows
        all namespaces to apply routing and iptables rules to
        kernel-originated traffic in that namespaces by matching UID 0.
        This is better than using the UID of the kernel socket that is
        sending the traffic, because the UID of kernel sockets created
        at namespace creation time (e.g., the per-processor ICMP and
        TCP sockets) is the UID of the user that created the socket,
        which might not be mapped in the namespace.
      
      Tested: compiles allnoconfig, allyesconfig, allmodconfig
      Tested: https://android-review.googlesource.com/253302
      
      Signed-off-by: default avatarLorenzo Colitti <lorenzo@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e2d118a1