1. 04 Jul, 2018 19 commits
    • David S. Miller's avatar
      Merge branch 'Handle-multiple-received-packets-at-each-stage' · 2d1b1385
      David S. Miller authored
      Edward Cree says:
      
      ====================
      Handle multiple received packets at each stage
      
      This patch series adds the capability for the network stack to receive a
       list of packets and process them as a unit, rather than handling each
       packet singly in sequence.  This is done by factoring out the existing
       datapath code at each layer and wrapping it in list handling code.
      
      The motivation for this change is twofold:
      * Instruction cache locality.  Currently, running the entire network
        stack receive path on a packet involves more code than will fit in the
        lowest-level icache, meaning that when the next packet is handled, the
        code has to be reloaded from more distant caches.  By handling packets
        in "row-major order", we ensure that the code at each layer is hot for
        most of the list.  (There is a corresponding downside in _data_ cache
        locality, since we are now touching every packet at every layer, but in
        practice there is easily enough room in dcache to hold one cacheline of
        each of the 64 packets in a NAPI poll.)
      * Reduction of indirect calls.  Owing to Spectre mitigations, indirect
        function calls are now more expensive than ever; they are also heavily
        used in the network stack's architecture (see [1]).  By replacing 64
        indirect calls to the next-layer per-packet function with a single
        indirect call to the next-layer list function, we can save CPU cycles.
      
      Drivers pass an SKB list to the stack at the end of the NAPI poll; this
       gives a natural batch size (the NAPI poll weight) and avoids waiting at
       the software level for further packets to make a larger batch (which
       would add latency).  It also means that the batch size is automatically
       tuned by the existing interrupt moderation mechanism.
      The stack then runs each layer of processing over all the packets in the
       list before proceeding to the next layer.  Where the 'next layer' (or
       the context in which it must run) differs among the packets, the stack
       splits the list; this 'late demux' means that packets which differ only
       in later headers (e.g. same L2/L3 but different L4) can traverse the
       early part of the stack together.
      Also, where the next layer is not (yet) list-aware, the stack can revert
       to calling the rest of the stack in a loop; this allows gradual/creeping
       listification, with no 'flag day' patch needed to listify everything.
      
      Patches 1-2 simply place received packets on a list during the event
       processing loop on the sfc EF10 architecture, then call the normal stack
       for each packet singly at the end of the NAPI poll.  (Analogues of patch
       #2 for other NIC drivers should be fairly straightforward.)
      Patches 3-9 extend the list processing as far as the IP receive handler.
      
      Patches 1-2 alone give about a 10% improvement in packet rate in the
       baseline test; adding patches 3-9 raises this to around 25%.
      
      Performance measurements were made with NetPerf UDP_STREAM, using 1-byte
       packets and a single core to handle interrupts on the RX side; this was
       in order to measure as simply as possible the packet rate handled by a
       single core.  Figures are in Mbit/s; divide by 8 to obtain Mpps.  The
       setup was tuned for maximum reproducibility, rather than raw performance.
       Full details and more results (both with and without retpolines) from a
       previous version of the patch series are presented in [2].
      
      The baseline test uses four streams, and multiple RXQs all bound to a
       single CPU (the netperf binary is bound to a neighbouring CPU).  These
       tests were run with retpolines.
      net-next: 6.91 Mb/s (datum)
       after 9: 8.46 Mb/s (+22.5%)
      Note however that these results are not robust; changes in the parameters
       of the test sometimes shrink the gain to single-digit percentages.  For
       instance, when using only a single RXQ, only a 4% gain was seen.
      
      One test variation was the use of software filtering/firewall rules.
       Adding a single iptables rule (UDP port drop on a port range not matching
       the test traffic), thus making the netfilter hook have work to do,
       reduced baseline performance but showed a similar gain from the patches:
      net-next: 5.02 Mb/s (datum)
       after 9: 6.78 Mb/s (+35.1%)
      
      Similarly, testing with a set of TC flower filters (kindly supplied by
       Cong Wang) gave the following:
      net-next: 6.83 Mb/s (datum)
       after 9: 8.86 Mb/s (+29.7%)
      
      These data suggest that the batching approach remains effective in the
       presence of software switching rules, and perhaps even improves the
       performance of those rules by allowing them and their codepaths to stay
       in cache between packets.
      
      Changes from v3:
      * Fixed build error when CONFIG_NETFILTER=n (thanks kbuild).
      
      Changes from v2:
      * Used standard list handling (and skb->list) instead of the skb-queue
        functions (that use skb->next, skb->prev).
        - As part of this, changed from a "dequeue, process, enqueue" model to
          using list_for_each_safe, list_del, and (new) list_cut_before.
      * Altered __netif_receive_skb_core() changes in patch 6 as per Willem de
        Bruijn's suggestions (separate **ppt_prev from *pt_prev; renaming).
      * Removed patches to Generic XDP, since they were producing no benefit.
        I may revisit them later.
      * Removed RFC tags.
      
      Changes from v1:
      * Rebased across 2 years' net-next movement (surprisingly straightforward).
        - Added Generic XDP handling to netif_receive_skb_list_internal()
        - Dealt with changes to PFMEMALLOC setting APIs
      * General cleanup of code and comments.
      * Skipped function calls for empty lists at various points in the stack
        (patch #9).
      * Added listified Generic XDP handling (patches 10-12), though it doesn't
        seem to help (see above).
      * Extended testing to cover software firewalls / netfilter etc.
      
      [1] http://vger.kernel.org/netconf2018_files/DavidMiller_netconf2018.pdf
      [2] http://vger.kernel.org/netconf2018_files/EdwardCree_netconf2018.pdf
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2d1b1385
    • Edward Cree's avatar
      net: don't bother calling list RX functions on empty lists · b9f463d6
      Edward Cree authored
      Generally the check should be very cheap, as the sk_buff_head is in cache.
      Signed-off-by: default avatarEdward Cree <ecree@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b9f463d6
    • Edward Cree's avatar
      net: ipv4: listify ip_rcv_finish · 5fa12739
      Edward Cree authored
      ip_rcv_finish_core(), if it does not drop, sets skb->dst by either early
       demux or route lookup.  The last step, calling dst_input(skb), is left to
       the caller; in the listified case, we split to form sublists with a common
       dst, but then ip_sublist_rcv_finish() just calls dst_input(skb) in a loop.
      The next step in listification would thus be to add a list_input() method
       to struct dst_entry.
      
      Early demux is an indirect call based on iph->protocol; this is another
       opportunity for listification which is not taken here (it would require
       slicing up ip_rcv_finish_core() to allow splitting on protocol changes).
      Signed-off-by: default avatarEdward Cree <ecree@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5fa12739
    • Edward Cree's avatar
      net: ipv4: listified version of ip_rcv · 17266ee9
      Edward Cree authored
      Also involved adding a way to run a netfilter hook over a list of packets.
       Rather than attempting to make netfilter know about lists (which would be
       a major project in itself) we just let it call the regular okfn (in this
       case ip_rcv_finish()) for any packets it steals, and have it give us back
       a list of packets it's synchronously accepted (which normally NF_HOOK
       would automatically call okfn() on, but we want to be able to potentially
       pass the list to a listified version of okfn().)
      The netfilter hooks themselves are indirect calls that still happen per-
       packet (see nf_hook_entry_hookfn()), but again, changing that can be left
       for future work.
      
      There is potential for out-of-order receives if the netfilter hook ends up
       synchronously stealing packets, as they will be processed before any
       accepts earlier in the list.  However, it was already possible for an
       asynchronous accept to cause out-of-order receives, so presumably this is
       considered OK.
      Signed-off-by: default avatarEdward Cree <ecree@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      17266ee9
    • Edward Cree's avatar
      net: core: propagate SKB lists through packet_type lookup · 88eb1944
      Edward Cree authored
      __netif_receive_skb_core() does a depressingly large amount of per-packet
       work that can't easily be listified, because the another_round looping
       makes it nontrivial to slice up into smaller functions.
      Fortunately, most of that work disappears in the fast path:
       * Hardware devices generally don't have an rx_handler
       * Unless you're tcpdumping or something, there is usually only one ptype
       * VLAN processing comes before the protocol ptype lookup, so doesn't force
         a pt_prev deliver
       so normally, __netif_receive_skb_core() will run straight through and pass
       back the one ptype found in ptype_base[hash of skb->protocol].
      Signed-off-by: default avatarEdward Cree <ecree@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      88eb1944
    • Edward Cree's avatar
      net: core: another layer of lists, around PF_MEMALLOC skb handling · 4ce0017a
      Edward Cree authored
      First example of a layer splitting the list (rather than merely taking
       individual packets off it).
      Involves new list.h function, list_cut_before(), like list_cut_position()
       but cuts on the other side of the given entry.
      Signed-off-by: default avatarEdward Cree <ecree@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4ce0017a
    • Edward Cree's avatar
      net: core: Another step of skb receive list processing · 7da517a3
      Edward Cree authored
      netif_receive_skb_list_internal() now processes a list and hands it
       on to the next function.
      Signed-off-by: default avatarEdward Cree <ecree@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7da517a3
    • Edward Cree's avatar
    • Edward Cree's avatar
      sfc: batch up RX delivery · e090bfb9
      Edward Cree authored
      Improves packet rate of 1-byte UDP receives by up to 10%.
      Signed-off-by: default avatarEdward Cree <ecree@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e090bfb9
    • Edward Cree's avatar
      net: core: trivial netif_receive_skb_list() entry point · f6ad8c1b
      Edward Cree authored
      Just calls netif_receive_skb() in a loop.
      Signed-off-by: default avatarEdward Cree <ecree@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f6ad8c1b
    • David S. Miller's avatar
      Merge branch 'sctp-fully-support-for-dscp-and-flowlabel-per-transport' · 2bdea157
      David S. Miller authored
      Xin Long says:
      
      ====================
      sctp: fully support for dscp and flowlabel per transport
      
      Now dscp and flowlabel are set from sock when sending the packets,
      but being multi-homing, sctp also supports for dscp and flowlabel
      per transport, which is described in section 8.1.12 in RFC6458.
      
      v1->v2:
        - define ip_queue_xmit as inline in net/ip.h, instead of exporting
          it in Patch 1/5 according to David's suggestion.
        - fix the param len check in sctp_s/getsockopt_peer_addr_params()
          in Patch 3/5 to guarantee that an old app built with old kernel
          headers could work on the newer kernel per Marcelo's point.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2bdea157
    • Xin Long's avatar
      sctp: check for ipv6_pinfo legal sndflow with flowlabel in sctp_v6_get_dst · 0999f021
      Xin Long authored
      The transport with illegal flowlabel should not be allowed to send
      packets. Other transport protocols already denies this.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0999f021
    • Xin Long's avatar
      sctp: add support for setting flowlabel when adding a transport · 4be4139f
      Xin Long authored
      Struct sockaddr_in6 has the member sin6_flowinfo that includes the
      ipv6 flowlabel, it should also support for setting flowlabel when
      adding a transport whose ipaddr is from userspace.
      
      Note that addrinfo in sctp_sendmsg is using struct in6_addr for
      the secondary addrs, which doesn't contain sin6_flowinfo, and
      it needs to copy sin6_flowinfo from the primary addr.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4be4139f
    • Xin Long's avatar
      sctp: add spp_ipv6_flowlabel and spp_dscp for sctp_paddrparams · 0b0dce7a
      Xin Long authored
      spp_ipv6_flowlabel and spp_dscp are added in sctp_paddrparams in
      this patch so that users could set sctp_sock/asoc/transport dscp
      and flowlabel with spp_flags SPP_IPV6_FLOWLABEL or SPP_DSCP by
      SCTP_PEER_ADDR_PARAMS , as described section 8.1.12 in RFC6458.
      
      As said in last patch, it uses '| 0x100000' or '|0x1' to mark
      flowlabel or dscp is set,  so that their values could be set
      to 0.
      
      Note that to guarantee that an old app built with old kernel
      headers could work on the newer kernel, the param's check in
      sctp_g/setsockopt_peer_addr_params() is also improved, which
      follows the way that sctp_g/setsockopt_delayed_ack() or some
      other sockopts' process that accept two types of params does.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0b0dce7a
    • Xin Long's avatar
      sctp: add support for dscp and flowlabel per transport · 8a9c58d2
      Xin Long authored
      Like some other per transport params, flowlabel and dscp are added
      in transport, asoc and sctp_sock. By default, transport sets its
      value from asoc's, and asoc does it from sctp_sock. flowlabel
      only works for ipv6 transport.
      
      Other than that they need to be passed down in sctp_xmit, flow4/6
      also needs to set them before looking up route in get_dst.
      
      Note that it uses '& 0x100000' to check if flowlabel is set and
      '& 0x1' (tos 1st bit is unused) to check if dscp is set by users,
      so that they could be set to 0 by sockopt in next patch.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8a9c58d2
    • Xin Long's avatar
      ipv4: add __ip_queue_xmit() that supports tos param · 69b9e1e0
      Xin Long authored
      This patch introduces __ip_queue_xmit(), through which the callers
      can pass tos param into it without having to set inet->tos. For
      ipv6, ip6_xmit() already allows passing tclass parameter.
      
      It's needed when some transport protocol doesn't use inet->tos,
      like sctp's per transport dscp, which will be added in next patch.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      69b9e1e0
    • Linus Walleij's avatar
      net: dsa: Add Vitesse VSC73xx DSA router driver · 05bd97fc
      Linus Walleij authored
      This adds a DSA driver for:
      
      Vitesse VSC7385 SparX-G5 5-port Integrated Gigabit Ethernet Switch
      Vitesse VSC7388 SparX-G8 8-port Integrated Gigabit Ethernet Switch
      Vitesse VSC7395 SparX-G5e 5+1-port Integrated Gigabit Ethernet Switch
      Vitesse VSC7398 SparX-G8e 8-port Integrated Gigabit Ethernet Switch
      
      These switches have a built-in 8051 CPU and can download and execute
      firmware in this CPU. They can also be configured to use an external
      CPU handling the switch in a memory-mapped manner by connecting to
      that external CPU's memory bus.
      
      This driver (currently) only takes control of the switch chip over
      SPI and configures it to route packages around when connected to a
      CPU port. The chip has embedded PHYs and VLAN support so we model it
      using DSA as a best fit so we can easily add VLAN support and maybe
      later also exploit the internal frame header to get more direct
      control over the switch.
      
      The four built-in GPIO lines are exposed using a standard GPIO chip.
      Signed-off-by: default avatarLinus Walleij <linus.walleij@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      05bd97fc
    • Linus Walleij's avatar
      net: phy: vitesse: Add support for VSC73xx · 975ae7c6
      Linus Walleij authored
      The VSC7385, VSC7388, VSC7395 and VSC7398 are integrated
      switch/router chips for 5+1 or 8-port switches/routers. When
      managed directly by Linux using DSA we need to do a special
      set-up "dance" on the PHY. Unfortunately these sequences
      switches the PHY to undocumented pages named 2a30 and 52b6
      and does undocumented things. It is described by these opaque
      sequences also in the reference manual. This is a best
      effort to integrate it anyways.
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarLinus Walleij <linus.walleij@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      975ae7c6
    • Linus Walleij's avatar
      net: dsa: Add DT bindings for Vitesse VSC73xx switches · 1decd2ec
      Linus Walleij authored
      This adds the device tree bindings for the Vitesse VSC73xx
      switches. We also add the vendor name for Vitesse.
      
      Cc: devicetree@vger.kernel.org
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarLinus Walleij <linus.walleij@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1decd2ec
  2. 03 Jul, 2018 11 commits
  3. 02 Jul, 2018 10 commits