1. 03 May, 2018 23 commits
    • Daniel Borkmann's avatar
      bpf, arm64: remove ld_abs/ld_ind · 816d9ef3
      Daniel Borkmann authored
      Since LD_ABS/LD_IND instructions are now removed from the core and
      reimplemented through a combination of inlined BPF instructions and
      a slow-path helper, we can get rid of the complexity from arm64 JIT.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      816d9ef3
    • Daniel Borkmann's avatar
      bpf, x64: remove ld_abs/ld_ind · e782bdcf
      Daniel Borkmann authored
      Since LD_ABS/LD_IND instructions are now removed from the core and
      reimplemented through a combination of inlined BPF instructions and
      a slow-path helper, we can get rid of the complexity from x64 JIT.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      e782bdcf
    • Daniel Borkmann's avatar
      bpf: add skb_load_bytes_relative helper · 4e1ec56c
      Daniel Borkmann authored
      This adds a small BPF helper similar to bpf_skb_load_bytes() that
      is able to load relative to mac/net header offset from the skb's
      linear data. Compared to bpf_skb_load_bytes(), it takes a fifth
      argument namely start_header, which is either BPF_HDR_START_MAC
      or BPF_HDR_START_NET. This allows for a more flexible alternative
      compared to LD_ABS/LD_IND with negative offset. It's enabled for
      tc BPF programs as well as sock filter program types where it's
      mainly useful in reuseport programs to ease access to lower header
      data.
      
      Reference: https://lists.iovisor.org/pipermail/iovisor-dev/2017-March/000698.htmlSigned-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      4e1ec56c
    • Daniel Borkmann's avatar
      bpf: implement ld_abs/ld_ind in native bpf · e0cea7ce
      Daniel Borkmann authored
      The main part of this work is to finally allow removal of LD_ABS
      and LD_IND from the BPF core by reimplementing them through native
      eBPF instead. Both LD_ABS/LD_IND were carried over from cBPF and
      keeping them around in native eBPF caused way more trouble than
      actually worth it. To just list some of the security issues in
      the past:
      
        * fdfaf64e ("x86: bpf_jit: support negative offsets")
        * 35607b02 ("sparc: bpf_jit: fix loads from negative offsets")
        * e0ee9c12 ("x86: bpf_jit: fix two bugs in eBPF JIT compiler")
        * 07aee943 ("bpf, sparc: fix usage of wrong reg for load_skb_regs after call")
        * 6d59b7db ("bpf, s390x: do not reload skb pointers in non-skb context")
        * 87338c8e ("bpf, ppc64: do not reload skb pointers in non-skb context")
      
      For programs in native eBPF, LD_ABS/LD_IND are pretty much legacy
      these days due to their limitations and more efficient/flexible
      alternatives that have been developed over time such as direct
      packet access. LD_ABS/LD_IND only cover 1/2/4 byte loads into a
      register, the load happens in host endianness and its exception
      handling can yield unexpected behavior. The latter is explained
      in depth in f6b1b3bf ("bpf: fix subprog verifier bypass by
      div/mod by 0 exception") with similar cases of exceptions we had.
      In native eBPF more recent program types will disable LD_ABS/LD_IND
      altogether through may_access_skb() in verifier, and given the
      limitations in terms of exception handling, it's also disabled
      in programs that use BPF to BPF calls.
      
      In terms of cBPF, the LD_ABS/LD_IND is used in networking programs
      to access packet data. It is not used in seccomp-BPF but programs
      that use it for socket filtering or reuseport for demuxing with
      cBPF. This is mostly relevant for applications that have not yet
      migrated to native eBPF.
      
      The main complexity and source of bugs in LD_ABS/LD_IND is coming
      from their implementation in the various JITs. Most of them keep
      the model around from cBPF times by implementing a fastpath written
      in asm. They use typically two from the BPF program hidden CPU
      registers for caching the skb's headlen (skb->len - skb->data_len)
      and skb->data. Throughout the JIT phase this requires to keep track
      whether LD_ABS/LD_IND are used and if so, the two registers need
      to be recached each time a BPF helper would change the underlying
      packet data in native eBPF case. At least in eBPF case, available
      CPU registers are rare and the additional exit path out of the
      asm written JIT helper makes it also inflexible since not all
      parts of the JITer are in control from plain C. A LD_ABS/LD_IND
      implementation in eBPF therefore allows to significantly reduce
      the complexity in JITs with comparable performance results for
      them, e.g.:
      
      test_bpf             tcpdump port 22             tcpdump complex
      x64      - before    15 21 10                    14 19  18
               - after      7 10 10                     7 10  15
      arm64    - before    40 91 92                    40 91 151
               - after     51 64 73                    51 62 113
      
      For cBPF we now track any usage of LD_ABS/LD_IND in bpf_convert_filter()
      and cache the skb's headlen and data in the cBPF prologue. The
      BPF_REG_TMP gets remapped from R8 to R2 since it's mainly just
      used as a local temporary variable. This allows to shrink the
      image on x86_64 also for seccomp programs slightly since mapping
      to %rsi is not an ereg. In callee-saved R8 and R9 we now track
      skb data and headlen, respectively. For normal prologue emission
      in the JITs this does not add any extra instructions since R8, R9
      are pushed to stack in any case from eBPF side. cBPF uses the
      convert_bpf_ld_abs() emitter which probes the fast path inline
      already and falls back to bpf_skb_load_helper_{8,16,32}() helper
      relying on the cached skb data and headlen as well. R8 and R9
      never need to be reloaded due to bpf_helper_changes_pkt_data()
      since all skb access in cBPF is read-only. Then, for the case
      of native eBPF, we use the bpf_gen_ld_abs() emitter, which calls
      the bpf_skb_load_helper_{8,16,32}_no_cache() helper unconditionally,
      does neither cache skb data and headlen nor has an inlined fast
      path. The reason for the latter is that native eBPF does not have
      any extra registers available anyway, but even if there were, it
      avoids any reload of skb data and headlen in the first place.
      Additionally, for the negative offsets, we provide an alternative
      bpf_skb_load_bytes_relative() helper in eBPF which operates
      similarly as bpf_skb_load_bytes() and allows for more flexibility.
      Tested myself on x64, arm64, s390x, from Sandipan on ppc64.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      e0cea7ce
    • Daniel Borkmann's avatar
      bpf: migrate ebpf ld_abs/ld_ind tests to test_verifier · 93731ef0
      Daniel Borkmann authored
      Remove all eBPF tests involving LD_ABS/LD_IND from test_bpf.ko. Reason
      is that the eBPF tests from test_bpf module do not go via BPF verifier
      and therefore any instruction rewrites from verifier cannot take place.
      
      Therefore, move them into test_verifier which runs out of user space,
      so that verfier can rewrite LD_ABS/LD_IND internally in upcoming patches.
      It will have the same effect since runtime tests are also performed from
      there. This also allows to finally unexport bpf_skb_vlan_{push,pop}_proto
      and keep it internal to core kernel.
      
      Additionally, also add further cBPF LD_ABS/LD_IND test coverage into
      test_bpf.ko suite.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      93731ef0
    • Daniel Borkmann's avatar
      bpf: prefix cbpf internal helpers with bpf_ · b390134c
      Daniel Borkmann authored
      No change in functionality, just remove the '__' prefix and replace it
      with a 'bpf_' prefix instead. We later on add a couple of more helpers
      for cBPF and keeping the scheme with '__' is suboptimal there.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      b390134c
    • Alexei Starovoitov's avatar
      Merge branch 'AF_XDP-initial-support' · 08dbc7a6
      Alexei Starovoitov authored
      Björn Töpel says:
      
      ====================
      This patch set introduces a new address family called AF_XDP that is
      optimized for high performance packet processing and, in upcoming
      patch sets, zero-copy semantics. In this patch set, we have removed
      all zero-copy related code in order to make it smaller, simpler and
      hopefully more review friendly. This patch set only supports copy-mode
      for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode
      for RX using the XDP_DRV path. Zero-copy support requires XDP and
      driver changes that Jesper Dangaard Brouer is working on. Some of his
      work has already been accepted. We will publish our zero-copy support
      for RX and TX on top of his patch sets at a later point in time.
      
      An AF_XDP socket (XSK) is created with the normal socket()
      syscall. Associated with each XSK are two queues: the RX queue and the
      TX queue. A socket can receive packets on the RX queue and it can send
      packets on the TX queue. These queues are registered and sized with
      the setsockopts XDP_RX_RING and XDP_TX_RING, respectively. It is
      mandatory to have at least one of these queues for each socket. In
      contrast to AF_PACKET V2/V3 these descriptor queues are separated from
      packet buffers. An RX or TX descriptor points to a data buffer in a
      memory area called a UMEM. RX and TX can share the same UMEM so that a
      packet does not have to be copied between RX and TX. Moreover, if a
      packet needs to be kept for a while due to a possible retransmit, the
      descriptor that points to that packet can be changed to point to
      another and reused right away. This again avoids copying data.
      
      This new dedicated packet buffer area is call a UMEM. It consists of a
      number of equally size frames and each frame has a unique frame id. A
      descriptor in one of the queues references a frame by referencing its
      frame id. The user space allocates memory for this UMEM using whatever
      means it feels is most appropriate (malloc, mmap, huge pages,
      etc). This memory area is then registered with the kernel using the new
      setsockopt XDP_UMEM_REG. The UMEM also has two queues: the FILL queue
      and the COMPLETION queue. The fill queue is used by the application to
      send down frame ids for the kernel to fill in with RX packet
      data. References to these frames will then appear in the RX queue of
      the XSK once they have been received. The completion queue, on the
      other hand, contains frame ids that the kernel has transmitted
      completely and can now be used again by user space, for either TX or
      RX. Thus, the frame ids appearing in the completion queue are ids that
      were previously transmitted using the TX queue. In summary, the RX and
      FILL queues are used for the RX path and the TX and COMPLETION queues
      are used for the TX path.
      
      The socket is then finally bound with a bind() call to a device and a
      specific queue id on that device, and it is not until bind is
      completed that traffic starts to flow. Note that in this patch set,
      all packet data is copied out to user-space.
      
      A new feature in this patch set is that the UMEM can be shared between
      processes, if desired. If a process wants to do this, it simply skips
      the registration of the UMEM and its corresponding two queues, sets a
      flag in the bind call and submits the XSK of the process it would like
      to share UMEM with as well as its own newly created XSK socket. The
      new process will then receive frame id references in its own RX queue
      that point to this shared UMEM. Note that since the queue structures
      are single-consumer / single-producer (for performance reasons), the
      new process has to create its own socket with associated RX and TX
      queues, since it cannot share this with the other process. This is
      also the reason that there is only one set of FILL and COMPLETION
      queues per UMEM. It is the responsibility of a single process to
      handle the UMEM. If multiple-producer / multiple-consumer queues are
      implemented in the future, this requirement could be relaxed.
      
      How is then packets distributed between these two XSK? We have
      introduced a new BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in
      full). The user-space application can place an XSK at an arbitrary
      place in this map. The XDP program can then redirect a packet to a
      specific index in this map and at this point XDP validates that the
      XSK in that map was indeed bound to that device and queue number. If
      not, the packet is dropped. If the map is empty at that index, the
      packet is also dropped. This also means that it is currently mandatory
      to have an XDP program loaded (and one XSK in the XSKMAP) to be able
      to get any traffic to user space through the XSK.
      
      AF_XDP can operate in two different modes: XDP_SKB and XDP_DRV. If the
      driver does not have support for XDP, or XDP_SKB is explicitly chosen
      when loading the XDP program, XDP_SKB mode is employed that uses SKBs
      together with the generic XDP support and copies out the data to user
      space. A fallback mode that works for any network device. On the other
      hand, if the driver has support for XDP, it will be used by the AF_XDP
      code to provide better performance, but there is still a copy of the
      data into user space.
      
      There is a xdpsock benchmarking/test application included that
      demonstrates how to use AF_XDP sockets with both private and shared
      UMEMs. Say that you would like your UDP traffic from port 4242 to end
      up in queue 16, that we will enable AF_XDP on. Here, we use ethtool
      for this:
      
            ethtool -N p3p2 rx-flow-hash udp4 fn
            ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
                action 16
      
      Running the rxdrop benchmark in XDP_DRV mode can then be done
      using:
      
            samples/bpf/xdpsock -i p3p2 -q 16 -r -N
      
      For XDP_SKB mode, use the switch "-S" instead of "-N" and all options
      can be displayed with "-h", as usual.
      
      We have run some benchmarks on a dual socket system with two Broadwell
      E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
      cores which gives a total of 28, but only two cores are used in these
      experiments. One for TR/RX and one for the user space application. The
      memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
      8192MB and with 8 of those DIMMs in the system we have 64 GB of total
      memory. The compiler used is gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0. The
      NIC is Intel I40E 40Gbit/s using the i40e driver.
      
      Below are the results in Mpps of the I40E NIC benchmark runs for 64
      and 1500 byte packets, generated by a commercial packet generator HW
      outputing packets at full 40 Gbit/s line rate. The results are without
      retpoline so that we can compare against previous numbers. With
      retpoline, the AF_XDP numbers drop with between 10 - 15 percent.
      
      AF_XDP performance 64 byte packets. Results from V2 in parenthesis.
      Benchmark   XDP_SKB   XDP_DRV
      rxdrop       2.9(3.0)   9.6(9.5)
      txpush       2.6(2.5)   NA*
      l2fwd        1.9(1.9)   2.5(2.5) (TX using XDP_SKB in both cases)
      
      AF_XDP performance 1500 byte packets:
      Benchmark   XDP_SKB   XDP_DRV
      rxdrop       2.1(2.2)   3.3(3.3)
      l2fwd        1.4(1.4)   1.8(1.8) (TX using XDP_SKB in both cases)
      
      * NA since we have no support for TX using the XDP_DRV infrastructure
        in this patch set. This is for a future patch set since it involves
        changes to the XDP NDOs. Some of this has been upstreamed by Jesper
        Dangaard Brouer.
      
      XDP performance on our system as a base line:
      
      64 byte packets:
      XDP stats       CPU     pps         issue-pps
      XDP-RX CPU      16      32.3(32.9)M  0
      
      1500 byte packets:
      XDP stats       CPU     pps         issue-pps
      XDP-RX CPU      16      3.3(3.3)M    0
      
      Changes from V2:
      
      * Fixed a race in XSKMAP map found by Will. The code has been
        completely rearchitected and is now simpler, faster, and hopefully
        also not racy. Please review and check if it holds.
      
      If you would like to diff V2 against V3, you can find them here:
      https://github.com/bjoto/linux/tree/af-xdp-v2-on-bpf-next
      https://github.com/bjoto/linux/tree/af-xdp-v3-on-bpf-next
      
      The structure of the patch set is as follows:
      
      Patches 1-3: Basic socket and umem plumbing
      Patches 4-9: RX support together with the new XSKMAP
      Patches 10-13: TX support
      Patch 14: Statistics support with getsockopt()
      Patch 15: Sample application
      
      We based this patch set on bpf-next commit a3fe1f6f ("tools:
      bpftool: change time format for program 'loaded at:' information")
      
      To do for this patch set:
      
      * Syzkaller torture session being worked on
      
      Post-series plan:
      
      * Optimize performance
      
      * Kernel selftest
      
      * Kernel load module support of AF_XDP would be nice. Unclear how to
        achieve this though since our XDP code depends on net/core.
      
      * Support for AF_XDP sockets without an XPD program loaded. In this
        case all the traffic on a queue should go up to the user space socket.
      
      * Daniel Borkmann's suggestion for a "copy to XDP socket, and return
        XDP_PASS" for a tcpdump-like functionality.
      
      * And of course getting to zero-copy support in small increments,
        starting with TX then adding RX.
      
      Thanks: Björn and Magnus
      ====================
      Acked-by: default avatarWillem de Bruijn <willemb@google.com>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      08dbc7a6
    • Magnus Karlsson's avatar
      samples/bpf: sample application and documentation for AF_XDP sockets · b4b8faa1
      Magnus Karlsson authored
      This is a sample application for AF_XDP sockets. The application
      supports three different modes of operation: rxdrop, txonly and l2fwd.
      
      To show-case a simple round-robin load-balancing between a set of
      sockets in an xskmap, set the RR_LB compile time define option to 1 in
      "xdpsock.h".
      
      v2: The entries variable was calculated twice in {umem,xq}_nb_avail.
      Co-authored-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      b4b8faa1
    • Magnus Karlsson's avatar
      xsk: statistics support · af75d9e0
      Magnus Karlsson authored
      In this commit, a new getsockopt is added: XDP_STATISTICS. This is
      used to obtain stats from the sockets.
      
      v2: getsockopt now returns size of stats structure.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      af75d9e0
    • Magnus Karlsson's avatar
      xsk: support for Tx · 35fcde7f
      Magnus Karlsson authored
      Here, Tx support is added. The user fills the Tx queue with frames to
      be sent by the kernel, and let's the kernel know using the sendmsg
      syscall.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      35fcde7f
    • Magnus Karlsson's avatar
      dev: packet: make packet_direct_xmit a common function · 865b03f2
      Magnus Karlsson authored
      The new dev_direct_xmit will be used by AF_XDP in later commits.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      865b03f2
    • Magnus Karlsson's avatar
      xsk: add Tx queue setup and mmap support · f6145903
      Magnus Karlsson authored
      Another setsockopt (XDP_TX_QUEUE) is added to let the process allocate
      a queue, where the user process can pass frames to be transmitted by
      the kernel.
      
      The mmapping of the queue is done using the XDP_PGOFF_TX_QUEUE offset.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      f6145903
    • Magnus Karlsson's avatar
      xsk: add umem completion queue support and mmap · fe230832
      Magnus Karlsson authored
      Here, we add another setsockopt for registered user memory (umem)
      called XDP_UMEM_COMPLETION_QUEUE. Using this socket option, the
      process can ask the kernel to allocate a queue (ring buffer) and also
      mmap it (XDP_UMEM_PGOFF_COMPLETION_QUEUE) into the process.
      
      The queue is used to explicitly pass ownership of umem frames from the
      kernel to user process. This will be used by the TX path to tell user
      space that a certain frame has been transmitted and user space can use
      it for something else, if it wishes.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      fe230832
    • Björn Töpel's avatar
      xsk: wire up XDP_SKB side of AF_XDP · 02671e23
      Björn Töpel authored
      This commit wires up the xskmap to XDP_SKB layer.
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      02671e23
    • Björn Töpel's avatar
      xsk: wire up XDP_DRV side of AF_XDP · 1b1a251c
      Björn Töpel authored
      This commit wires up the xskmap to XDP_DRV layer.
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      1b1a251c
    • Björn Töpel's avatar
      bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP · fbfc504a
      Björn Töpel authored
      The xskmap is yet another BPF map, very much inspired by
      dev/cpu/sockmap, and is a holder of AF_XDP sockets. A user application
      adds AF_XDP sockets into the map, and by using the bpf_redirect_map
      helper, an XDP program can redirect XDP frames to an AF_XDP socket.
      
      Note that a socket that is bound to certain ifindex/queue index will
      *only* accept XDP frames from that netdev/queue index. If an XDP
      program tries to redirect from a netdev/queue index other than what
      the socket is bound to, the frame will not be received on the socket.
      
      A socket can reside in multiple maps.
      
      v3: Fixed race and simplified code.
      v2: Removed one indirection in map lookup.
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      fbfc504a
    • Björn Töpel's avatar
      xsk: add Rx receive functions and poll support · c497176c
      Björn Töpel authored
      Here the actual receive functions of AF_XDP are implemented, that in a
      later commit, will be called from the XDP layers.
      
      There's one set of functions for the XDP_DRV side and another for
      XDP_SKB (generic).
      
      A new XDP API, xdp_return_buff, is also introduced.
      
      Adding xdp_return_buff, which is analogous to xdp_return_frame, but
      acts upon an struct xdp_buff. The API will be used by AF_XDP in future
      commits.
      
      Support for the poll syscall is also implemented.
      
      v2: xskq_validate_id did not update cons_tail.
          The entries variable was calculated twice in xskq_nb_avail.
          Squashed xdp_return_buff commit.
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c497176c
    • Magnus Karlsson's avatar
      xsk: add support for bind for Rx · 965a9909
      Magnus Karlsson authored
      Here, the bind syscall is added. Binding an AF_XDP socket, means
      associating the socket to an umem, a netdev and a queue index. This
      can be done in two ways.
      
      The first way, creating a "socket from scratch". Create the umem using
      the XDP_UMEM_REG setsockopt and an associated fill queue with
      XDP_UMEM_FILL_QUEUE. Create the Rx queue using the XDP_RX_QUEUE
      setsockopt. Call bind passing ifindex and queue index ("channel" in
      ethtool speak).
      
      The second way to bind a socket, is simply skipping the
      umem/netdev/queue index, and passing another already setup AF_XDP
      socket. The new socket will then have the same umem/netdev/queue index
      as the parent so it will share the same umem. You must also set the
      flags field in the socket address to XDP_SHARED_UMEM.
      
      v2: Use PTR_ERR instead of passing error variable explicitly.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      965a9909
    • Björn Töpel's avatar
      xsk: add Rx queue setup and mmap support · b9b6b68e
      Björn Töpel authored
      Another setsockopt (XDP_RX_QUEUE) is added to let the process allocate
      a queue, where the kernel can pass completed Rx frames from the kernel
      to user process.
      
      The mmapping of the queue is done using the XDP_PGOFF_RX_QUEUE offset.
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      b9b6b68e
    • Magnus Karlsson's avatar
      xsk: add umem fill queue support and mmap · 423f3832
      Magnus Karlsson authored
      Here, we add another setsockopt for registered user memory (umem)
      called XDP_UMEM_FILL_QUEUE. Using this socket option, the process can
      ask the kernel to allocate a queue (ring buffer) and also mmap it
      (XDP_UMEM_PGOFF_FILL_QUEUE) into the process.
      
      The queue is used to explicitly pass ownership of umem frames from the
      user process to the kernel. These frames will in a later patch be
      filled in with Rx packet data by the kernel.
      
      v2: Fixed potential crash in xsk_mmap.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      423f3832
    • Björn Töpel's avatar
      xsk: add user memory registration support sockopt · c0c77d8f
      Björn Töpel authored
      In this commit the base structure of the AF_XDP address family is set
      up. Further, we introduce the abilty register a window of user memory
      to the kernel via the XDP_UMEM_REG setsockopt syscall. The memory
      window is viewed by an AF_XDP socket as a set of equally large
      frames. After a user memory registration all frames are "owned" by the
      user application, and not the kernel.
      
      v2: More robust checks on umem creation and unaccount on error.
          Call set_page_dirty_lock on cleanup.
          Simplified xdp_umem_reg.
      Co-authored-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c0c77d8f
    • Björn Töpel's avatar
      net: initial AF_XDP skeleton · 68e8b849
      Björn Töpel authored
      Buildable skeleton of AF_XDP without any functionality. Just what it
      takes to register a new address family.
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      68e8b849
    • Wang YanQing's avatar
      bpf, x86_32: add eBPF JIT compiler for ia32 · 03f5781b
      Wang YanQing authored
      The JIT compiler emits ia32 bit instructions. Currently, It supports eBPF
      only. Classic BPF is supported because of the conversion by BPF core.
      
      Almost all instructions from eBPF ISA supported except the following:
      BPF_ALU64 | BPF_DIV | BPF_K
      BPF_ALU64 | BPF_DIV | BPF_X
      BPF_ALU64 | BPF_MOD | BPF_K
      BPF_ALU64 | BPF_MOD | BPF_X
      BPF_STX | BPF_XADD | BPF_W
      BPF_STX | BPF_XADD | BPF_DW
      
      It doesn't support BPF_JMP|BPF_CALL with BPF_PSEUDO_CALL at the moment.
      
      IA32 has few general purpose registers, EAX|EDX|ECX|EBX|ESI|EDI. I use
      EAX|EDX|ECX|EBX as temporary registers to simulate instructions in eBPF
      ISA, and allocate ESI|EDI to BPF_REG_AX for constant blinding, all others
      eBPF registers, R0-R10, are simulated through scratch space on stack.
      
      The reasons behind the hardware registers allocation policy are:
      1:MUL need EAX:EDX, shift operation need ECX, so they aren't fit
        for general eBPF 64bit register simulation.
      2:We need at least 4 registers to simulate most eBPF ISA operations
        on registers operands instead of on register&memory operands.
      3:We need to put BPF_REG_AX on hardware registers, or constant blinding
        will degrade jit performance heavily.
      
      Tested on PC (Intel(R) Core(TM) i5-5200U CPU).
      Testing results on i5-5200U:
      1) test_bpf: Summary: 349 PASSED, 0 FAILED, [319/341 JIT'ed]
      2) test_progs: Summary: 83 PASSED, 0 FAILED.
      3) test_lpm: OK
      4) test_lru_map: OK
      5) test_verifier: Summary: 828 PASSED, 0 FAILED.
      
      Above tests are all done in following two conditions separately:
      1:bpf_jit_enable=1 and bpf_jit_harden=0
      2:bpf_jit_enable=1 and bpf_jit_harden=2
      
      Below are some numbers for this jit implementation:
      Note:
        I run test_progs in kselftest 100 times continuously for every condition,
        the numbers are in format: total/times=avg.
        The numbers that test_bpf reports show almost the same relation.
      
      a:jit_enable=0 and jit_harden=0            b:jit_enable=1 and jit_harden=0
        test_pkt_access:PASS:ipv4:15622/100=156    test_pkt_access:PASS:ipv4:10674/100=106
        test_pkt_access:PASS:ipv6:9130/100=91      test_pkt_access:PASS:ipv6:4855/100=48
        test_xdp:PASS:ipv4:240198/100=2401         test_xdp:PASS:ipv4:138912/100=1389
        test_xdp:PASS:ipv6:137326/100=1373         test_xdp:PASS:ipv6:68542/100=685
        test_l4lb:PASS:ipv4:61100/100=611          test_l4lb:PASS:ipv4:37302/100=373
        test_l4lb:PASS:ipv6:101000/100=1010        test_l4lb:PASS:ipv6:55030/100=550
      
      c:jit_enable=1 and jit_harden=2
        test_pkt_access:PASS:ipv4:10558/100=105
        test_pkt_access:PASS:ipv6:5092/100=50
        test_xdp:PASS:ipv4:131902/100=1319
        test_xdp:PASS:ipv6:77932/100=779
        test_l4lb:PASS:ipv4:38924/100=389
        test_l4lb:PASS:ipv6:57520/100=575
      
      The numbers show we get 30%~50% improvement.
      
      See Documentation/networking/filter.txt for more information.
      
      Changelog:
      
       Changes v5-v6:
       1:Add do {} while (0) to RETPOLINE_RAX_BPF_JIT for
         consistence reason.
       2:Clean up non-standard comments, reported by Daniel Borkmann.
       3:Fix a memory leak issue, repoted by Daniel Borkmann.
      
       Changes v4-v5:
       1:Delete is_on_stack, BPF_REG_AX is the only one
         on real hardware registers, so just check with
         it.
       2:Apply commit 1612a981 ("bpf, x64: fix JIT emission
         for dead code"), suggested by Daniel Borkmann.
      
       Changes v3-v4:
       1:Fix changelog in commit.
         I install llvm-6.0, then test_progs willn't report errors.
         I submit another patch:
         "bpf: fix misaligned access for BPF_PROG_TYPE_PERF_EVENT program type on x86_32 platform"
         to fix another problem, after that patch, test_verifier willn't report errors too.
       2:Fix clear r0[1] twice unnecessarily in *BPF_IND|BPF_ABS* simulation.
      
       Changes v2-v3:
       1:Move BPF_REG_AX to real hardware registers for performance reason.
       3:Using bpf_load_pointer instead of bpf_jit32.S, suggested by Daniel Borkmann.
       4:Delete partial codes in 1c2a088a, suggested by Daniel Borkmann.
       5:Some bug fixes and comments improvement.
      
       Changes v1-v2:
       1:Fix bug in emit_ia32_neg64.
       2:Fix bug in emit_ia32_arsh_r64.
       3:Delete filename in top level comment, suggested by Thomas Gleixner.
       4:Delete unnecessary boiler plate text, suggested by Thomas Gleixner.
       5:Rewrite some words in changelog.
       6:CodingSytle improvement and a little more comments.
      Signed-off-by: default avatarWang YanQing <udknight@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      03f5781b
  2. 02 May, 2018 2 commits
    • Quentin Monnet's avatar
      bpf: relax constraints on formatting for eBPF helper documentation · 6f96674d
      Quentin Monnet authored
      The Python script used to parse and extract eBPF helpers documentation
      from include/uapi/linux/bpf.h expects a very specific formatting for the
      descriptions (single dot represents a space, '>' stands for a tab):
      
          /*
           ...
           *.int bpf_helper(list of arguments)
           *.>    Description
           *.>    >       Start of description
           *.>    >       Another line of description
           *.>    >       And yet another line of description
           *.>    Return
           *.>    >       0 on success, or a negative error in case of failure
           ...
           */
      
      This is too strict, and painful for developers who wants to add
      documentation for new helpers. Worse, it is extremely difficult to check
      that the formatting is correct during reviews. Change the format
      expected by the script and make it more flexible. The script now works
      whether or not the initial space (right after the star) is present, and
      accepts both tabs and white spaces (or a combination of both) for
      indenting description sections and contents.
      
      Concretely, something like the following would now be supported:
      
          /*
           ...
           *int bpf_helper(list of arguments)
           *......Description
           *.>    >       Start of description...
           *>     >       Another line of description
           *..............And yet another line of description
           *>     Return
           *.>    ........0 on success, or a negative error in case of failure
           ...
           */
      
      While at it, remove unnecessary carets from each regex used with match()
      in the script. They are redundant, as match() tries to match from the
      beginning of the string by default.
      
      v2: Remove unnecessary caret when a regex is used with match().
      Signed-off-by: default avatarQuentin Monnet <quentin.monnet@netronome.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      6f96674d
    • Ingo Molnar's avatar
      x86/bpf: Clean up non-standard comments, to make the code more readable · a2c7a983
      Ingo Molnar authored
      So by chance I looked into x86 assembly in arch/x86/net/bpf_jit_comp.c and
      noticed the weird and inconsistent comment style it mistakenly learned from
      the networking code:
      
       /* Multi-line comment ...
        * ... looks like this.
        */
      
      Fix this to use the standard comment style specified in Documentation/CodingStyle
      and used in arch/x86/ as well:
      
       /*
        * Multi-line comment ...
        * ... looks like this.
        */
      
      Also, to quote Linus's ... more explicit views about this:
      
        http://article.gmane.org/gmane.linux.kernel.cryptoapi/21066
      
        > But no, the networking code picked *none* of the above sane formats.
        > Instead, it picked these two models that are just half-arsed
        > shit-for-brains:
        >
        >  (no)
        >      /* This is disgusting drug-induced
        >        * crap, and should die
        >        */
        >
        >   (no-no-no)
        >       /* This is also very nasty
        >        * and visually unbalanced */
        >
        > Please. The networking code actually has the *worst* possible comment
        > style. You can literally find that (no-no-no) style, which is just
        > really horribly disgusting and worse than the otherwise fairly similar
        > (d) in pretty much every way.
      
      Also improve the comments and some other details while at it:
      
       - Don't mix same-line and previous-line comment style on otherwise
         identical code patterns within the same function,
      
       - capitalize 'BPF' and x86 register names consistently,
      
       - capitalize sentences consistently,
      
       - instead of 'x64' use 'x86-64': x64 is a Microsoft specific term,
      
       - use more consistent punctuation,
      
       - use standard coding style in macros as well,
      
       - fix typos and a few other minor details.
      
      Consistent coding style is not optional, at least in arch/x86/.
      
      No change in functionality.
      
      ( In case this commit causes conflicts with pending development code
        I'll be glad to help resolve any conflicts! )
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Alexei Starovoitov <ast@fb.com>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: netdev@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      a2c7a983
  3. 01 May, 2018 1 commit
    • Quentin Monnet's avatar
      tools: bpftool: change time format for program 'loaded at:' information · a3fe1f6f
      Quentin Monnet authored
      To make eBPF program load time easier to parse from "bpftool prog"
      output for machines, change the time format used by the program. The
      format now differs for plain and JSON version:
      
      - Plain version uses a string formatted according to ISO 8601.
      - JSON uses the number of seconds since the Epoch, wich is less friendly
        for humans but even easier to process.
      
      Example output:
      
          # ./bpftool prog
          41298: xdp  tag a04f5eef06a7f555 dev foo
                  loaded_at 2018-04-18T17:19:47+0100  uid 0
                  xlated 16B  not jited  memlock 4096B
      
          # ./bpftool prog -p
          [{
                  "id": 41298,
                  "type": "xdp",
                  "tag": "a04f5eef06a7f555",
                  "gpl_compatible": false,
                  "dev": {
                      "ifindex": 14,
                      "ns_dev": 3,
                      "ns_inode": 4026531993,
                      "ifname": "foo"
                  },
                  "loaded_at": 1524068387,
                  "uid": 0,
                  "bytes_xlated": 16,
                  "jited": false,
                  "bytes_memlock": 4096
              }
          ]
      
      Previously, "Apr 18/17:19" would be used at both places.
      Suggested-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarQuentin Monnet <quentin.monnet@netronome.com>
      Acked-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      a3fe1f6f
  4. 30 Apr, 2018 7 commits
  5. 29 Apr, 2018 7 commits
    • Teng Qin's avatar
      bpf: Allow bpf_current_task_under_cgroup in interrupt · 7ef37712
      Teng Qin authored
      Currently, the bpf_current_task_under_cgroup helper has a check where if
      the BPF program is running in_interrupt(), it will return -EINVAL. This
      prevents the helper to be used in many useful scenarios, particularly
      BPF programs attached to Perf Events.
      
      This commit removes the check. Tested a few NMI (Perf Event) and some
      softirq context, the helper returns the correct result.
      Signed-off-by: default avatarTeng Qin <qinteng@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      7ef37712
    • Alexei Starovoitov's avatar
      Merge branch 'fix-bpf-helpers-doc' · fcf85729
      Alexei Starovoitov authored
      Andrey Ignatov says:
      
      ====================
      BPF helpers documentation in UAPI refers to kernel ctx structures when it
      has to refer to user visible ones. Fix it.
      ====================
      Reviewed-by: default avatarQuentin Monnet <quentin.monnet@netronome.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      fcf85729
    • Andrey Ignatov's avatar
      bpf: Sync bpf.h to tools/ · 96871b9f
      Andrey Ignatov authored
      The patch syncs bpf.h to tools/.
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      96871b9f
    • Andrey Ignatov's avatar
      bpf: Fix helpers ctx struct types in uapi doc · a3ef8e9a
      Andrey Ignatov authored
      Helpers may operate on two types of ctx structures: user visible ones
      (e.g. `struct bpf_sock_ops`) when used in user programs, and kernel ones
      (e.g. `struct bpf_sock_ops_kern`) in kernel implementation.
      
      UAPI documentation must refer to only user visible structures.
      
      The patch replaces references to `_kern` structures in BPF helpers
      description by corresponding user visible structures.
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      a3ef8e9a
    • Alexei Starovoitov's avatar
      Merge branch 'bpf_get_stack' · f60ad0a0
      Alexei Starovoitov authored
      Yonghong Song says:
      
      ====================
      Currently, stackmap and bpf_get_stackid helper are provided
      for bpf program to get the stack trace. This approach has
      a limitation though. If two stack traces have the same hash,
      only one will get stored in the stackmap table regardless of
      whether BPF_F_REUSE_STACKID is specified or not,
      so some stack traces may be missing from user perspective.
      
      This patch implements a new helper, bpf_get_stack, will
      send stack traces directly to bpf program. The bpf program
      is able to see all stack traces, and then can do in-kernel
      processing or send stack traces to user space through
      shared map or bpf_perf_event_output.
      
      Patches #1 and #2 implemented the core kernel support.
      Patch #3 removes two never-hit branches in verifier.
      Patches #4 and #5 are two verifier improves to make
      bpf programming easier. Patch #6 synced the new helper
      to tools headers. Patch #7 moved perf_event polling code
      and ksym lookup code from samples/bpf to
      tools/testing/selftests/bpf. Patch #8 added a verifier
      test in tools/bpf for new verifier change.
      Patches #9 and #10 added tests for raw tracepoint prog
      and tracepoint prog respectively.
      
      Changelogs:
        v8 -> v9:
          . make function perf_event_mmap (in trace_helpers.c) extern
            to decouple perf_event_mmap and perf_event_poller.
          . add jit enabled handling for kernel stack verification
            in Patch #9. Since we did not have a good way to
            verify jit enabled kernel stack, just return true if
            the kernel stack is not empty.
          . In path #9, using raw_syscalls/sys_enter instead of
            sched/sched_switch, removed calling cmd
            "task 1 dd if=/dev/zero of=/dev/null" which is left
            with dangling process after the program exited.
        v7 -> v8:
          . rebase on top of latest bpf-next
          . simplify BPF_ARSH dst_reg->smin_val/smax_value tracking
          . rewrite the description of bpf_get_stack() in uapi bpf.h
            based on new format.
        v6 -> v7:
          . do perf callchain buffer allocation inside the
            verifier. so if the prog->has_callchain_buf is set,
            it is guaranteed that the buffer has been allocated.
          . change condition "trace_nr <= skip" to "trace_nr < skip"
            so that for zero size buffer, return 0 instead of -EFAULT
        v5 -> v6:
          . after refining return register smax_value and umax_value
            for helpers bpf_get_stack and bpf_probe_read_str,
            bounds and var_off of the return register are further refined.
          . added missing commit message for tools header sync commit.
          . removed one unnecessary empty line.
        v4 -> v5:
          . relied on dst_reg->var_off to refine umin_val/umax_val
            in verifier handling BPF_ARSH value range tracking,
            suggested by Edward.
        v3 -> v4:
          . fixed a bug when meta ptr is set to NULL in check_func_arg.
          . introduced tnum_arshift and added detailed comments for
            the underlying implementation
          . avoided using VLA in tools/bpf test_progs.
        v2 -> v3:
          . used meta to track helper memory size argument
          . implemented range checking for ARSH in verifier
          . moved perf event polling and ksym related functions
            from samples/bpf to tools/bpf
          . added test to compare build id's between bpf_get_stackid
            and bpf_get_stack
        v1 -> v2:
          . fixed compilation error when CONFIG_PERF_EVENTS is not enabled
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      f60ad0a0
    • Yonghong Song's avatar
      tools/bpf: add a test for bpf_get_stack with tracepoint prog · 79b45350
      Yonghong Song authored
      The test_stacktrace_map and test_stacktrace_build_id are
      enhanced to call bpf_get_stack in the helper to get the
      stack trace as well.  The stack traces from bpf_get_stack
      and bpf_get_stackid are compared to ensure that for the
      same stack as represented as the same hash, their ip addresses
      or build id's must be the same.
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      79b45350
    • Yonghong Song's avatar
      tools/bpf: add a test for bpf_get_stack with raw tracepoint prog · 173965fb
      Yonghong Song authored
      The test attached a raw_tracepoint program to raw_syscalls/sys_enter.
      It tested to get stack for user space, kernel space and user
      space with build_id request. It also tested to get user
      and kernel stack into the same buffer with back-to-back
      bpf_get_stack helper calls.
      
      If jit is not enabled, the user space application will check
      to ensure that the kernel function for raw_tracepoint
      ___bpf_prog_run is part of the stack.
      
      If jit is enabled, we did not have a reliable way to
      verify the kernel stack, so just assume the kernel stack
      is good when the kernel stack size is greater than 0.
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      173965fb