1. 02 Mar, 2019 6 commits
    • brakmo's avatar
      bpf: User program for testing HBM · a1270fe9
      brakmo authored
      The program nrm creates a cgroup and attaches a BPF program to the
      cgroup for testing HBM (Host Bandwidth Manager) for egress traffic.
      One still needs to create network traffic. This can be done through
      netesto, netperf or iperf3.
      A follow-up patch contains a script to create traffic.
      
      USAGE: hbm [-d] [-l] [-n <id>] [-r <rate>] [-s] [-t <secs>]
                 [-w] [-h] [prog]
        Where:
         -d        Print BPF trace debug buffer
         -l        Also limit flows doing loopback
         -n <#>    To create cgroup "/hbm#" and attach prog. Default is /nrm1
                   This is convenient when testing HBM in more than 1 cgroup
         -r <rate> Rate limit in Mbps
         -s        Get HBM stats (marked, dropped, etc.)
         -t <time> Exit after specified seconds (deault is 0)
         -w        Work conserving flag. cgroup can increase its bandwidth
                   beyond the rate limit specified while there is available
                   bandwidth. Current implementation assumes there is only
                   NIC (eth0), but can be extended to support multiple NICs.
                   Currrently only supported for egress. Note, this is just
      	     a proof of concept.
         -h        Print this info
         prog      BPF program file name. Name defaults to hbm_out_kern.o
      
      More information about HBM can be found in the paper "BPF Host Resource
      Management" presented at the 2018 Linux Plumbers Conference, Networking Track
      (http://vger.kernel.org/lpc_net2018_talks/LPC%20BPF%20Network%20Resource%20Paper.pdf)
      Signed-off-by: default avatarLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      a1270fe9
    • brakmo's avatar
      bpf: Sample HBM BPF program to limit egress bw · 187d0738
      brakmo authored
      A cgroup skb BPF program to limit cgroup output bandwidth.
      It uses a modified virtual token bucket queue to limit average
      egress bandwidth. The implementation uses credits instead of tokens.
      Negative credits imply that queueing would have happened (this is
      a virtual queue, so no queueing is done by it. However, queueing may
      occur at the actual qdisc (which is not used for rate limiting).
      
      This implementation uses 3 thresholds, one to start marking packets and
      the other two to drop packets:
                                       CREDIT
             - <--------------------------|------------------------> +
                   |    |          |      0
                   |  Large pkt    |
                   |  drop thresh  |
        Small pkt drop             Mark threshold
            thresh
      
      The effect of marking depends on the type of packet:
      a) If the packet is ECN enabled, then the packet is ECN ce marked.
         The current mark threshold is tuned for DCTCP.
      c) Else, it is dropped if it is a large packet.
      
      If the credit is below the drop threshold, the packet is dropped.
      Note that dropping a packet through the BPF program does not trigger CWR
      (Congestion Window Reduction) in TCP packets. A future patch will add
      support for triggering CWR.
      
      This BPF program actually uses 2 drop thresholds, one threshold
      for larger packets (>= 120 bytes) and another for smaller packets. This
      protects smaller packets such as SYNs, ACKs, etc.
      
      The default bandwidth limit is set at 1Gbps but this can be changed by
      a user program through a shared BPF map. In addition, by default this BPF
      program does not limit connections using loopback. This behavior can be
      overwritten by the user program. There is also an option to calculate
      some statistics, such as percent of packets marked or dropped, which
      the user program can access.
      
      A latter patch provides such a program (hbm.c)
      Signed-off-by: default avatarLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      187d0738
    • brakmo's avatar
      bpf: sync bpf.h to tools and update bpf_helpers.h · 5cce85c6
      brakmo authored
      This patch syncs the uapi bpf.h to tools/ and also updates
      bpf_herlpers.h in tools/
      Signed-off-by: default avatarLawrence Brakmo <brakmo@fb.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      5cce85c6
    • brakmo's avatar
      bpf: add bpf helper bpf_skb_ecn_set_ce · f7c917ba
      brakmo authored
      This patch adds a new bpf helper BPF_FUNC_skb_ecn_set_ce
      "int bpf_skb_ecn_set_ce(struct sk_buff *skb)". It is added to
      BPF_PROG_TYPE_CGROUP_SKB typed bpf_prog which currently can
      be attached to the ingress and egress path. The helper is needed
      because his type of bpf_prog cannot modify the skb directly.
      
      This helper is used to set the ECN field of ECN capable IP packets to ce
      (congestion encountered) in the IPv6 or IPv4 header of the skb. It can be
      used by a bpf_prog to manage egress or ingress network bandwdith limit
      per cgroupv2 by inducing an ECN response in the TCP sender.
      This works best when using DCTCP.
      Signed-off-by: default avatarLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      f7c917ba
    • Yonghong Song's avatar
      samples/bpf: silence compiler warning for xdpsock_user.c · b74e21ab
      Yonghong Song authored
      Compiling xdpsock_user.c with 4.8.5, I hit the following
      compilation warning:
          HOSTCC  samples/bpf/xdpsock_user.o
        /data/users/yhs/work/net-next/samples/bpf/xdpsock_user.c: In function ‘main’:
        /data/users/yhs/work/net-next/samples/bpf/xdpsock_user.c:449:6: warning: ‘idx_cq’ may be used unini
        tialized in this function [-Wmaybe-uninitialized]
          u32 idx_cq, idx_fq;
              ^
        /data/users/yhs/work/net-next/samples/bpf/xdpsock_user.c:606:7: warning: ‘idx_rx’ may be used unini
        tialized in this function [-Wmaybe-uninitialized]
           u32 idx_rx, idx_tx = 0;
               ^
        /data/users/yhs/work/net-next/samples/bpf/xdpsock_user.c:506:6: warning: ‘idx_rx’ may be used unini
        tialized in this function [-Wmaybe-uninitialized]
          u32 idx_rx, idx_fq = 0;
      
      As an example, the code pattern looks like:
          u32 idx_cq;
          ...
          ret = xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq);
          if (ret) {
            ...
          }
          ... idx_fq ...
      The compiler warns since it does not know whether &idx_fq is assigned
      or not inside the library function xsk_ring_prod__reserve().
      
      Let us assign an initial value 0 to such auto variables to silence
      compiler warning.
      
      Fixes: 248c7f9c ("samples/bpf: convert xdpsock to use libbpf for AF_XDP access")
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Acked-by: default avatarJonathan Lemon <jonathan.lemon@gmail.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      b74e21ab
    • Yonghong Song's avatar
      selftests/bpf: set unlimited RLIMIT_MEMLOCK for test_sock_fields · a83de906
      Yonghong Song authored
      This is to avoid permission denied error. A lot of systems
      may have a much lower number, e.g., 64KB, for RLIMIT_MEMLOCK,
      which may not be sufficient for the test to run successfully.
      
      Fixes: e0b27b3f ("bpf: Add test_sock_fields for skb->sk and bpf_tcp_sock")
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      a83de906
  2. 01 Mar, 2019 11 commits
  3. 28 Feb, 2019 13 commits
  4. 27 Feb, 2019 6 commits
  5. 25 Feb, 2019 4 commits
    • Daniel Borkmann's avatar
      Merge branch 'bpf-libbpf-af-xdp' · 143bdc2e
      Daniel Borkmann authored
      Magnus Karlsson says:
      
      ====================
      This patch proposes to add AF_XDP support to libbpf. The main reason
      for this is to facilitate writing applications that use AF_XDP by
      offering higher-level APIs that hide many of the details of the AF_XDP
      uapi. This is in the same vein as libbpf facilitates XDP adoption by
      offering easy-to-use higher level interfaces of XDP
      functionality. Hopefully this will facilitate adoption of AF_XDP, make
      applications using it simpler and smaller, and finally also make it
      possible for applications to benefit from optimizations in the AF_XDP
      user space access code. Previously, people just copied and pasted the
      code from the sample application into their application, which is not
      desirable.
      
      The proposed interface is composed of two parts:
      
      * Low-level access interface to the four rings and the packet
      * High-level control plane interface for creating and setting up umems
        and AF_XDP sockets. This interface also loads a simple XDP program
        that routes all traffic on a queue up to the AF_XDP socket.
      
      The sample program has been updated to use this new interface and in
      that process it lost roughly 300 lines of code. I cannot detect any
      performance degradations due to the use of this library instead of the
      previous functions that were inlined in the sample application. But I
      did measure this on a slower machine and not the Broadwell that we
      normally use.
      
      The rings are now called xsk_ring and when a producer operates on
      it. It is xsk_ring_prod and for a consumer it is xsk_ring_cons. This
      way we can get some compile time error checking that the rings are
      used correctly.
      
      Comments and contenplations:
      
      * The current behaviour is that the library loads an XDP program (if
        requested to do so) but the clean up of this program is left to the
        application. It would be possible to implement this cleanup in the
        library, but it would require state to be kept on netdev level,
        which there is none at the moment, and the synchronization of this
        between processes. All this adding complexity. But when we get an
        XDP program per queue id, then it becomes trivial to also remove the
        XDP program when the application exits. This proposal from Jesper,
        Björn and others will also improve the performance of libbpf, since
        most of the XDP program code can be removed when that feature is
        supported.
      
      * In a future release, I am planning on adding a higher level data
        plane interface too. This will be based around recvmsg and sendmsg
        with the use of struct iovec for batching, without the user having
        to know anything about the underlying four rings of an AF_XDP
        socket. There will be one semantic difference though from the
        standard recvmsg and that is that the kernel will fill in the iovecs
        instead of the application. But the rest should be the same as the
        libc versions so that application writers feel at home.
      
      Patch 1: adds AF_XDP support in libbpf
      Patch 2: updates the xdpsock sample application to use the libbpf functions
      Patch 3: Documentation update to help first time users
      
      Changes v5 to v6:
        * Fixed prog_fd bug found by Xiaolong Ye. Thanks!
      Changes v4 to v5:
        * Added a FAQ to the documentation
        * Removed xsk_umem__get_data and renamed xsk_umem__get_dat_raw to
          xsk_umem__get_data
        * Replaced the netlink code with bpf_get_link_xdp_id()
        * Dynamic allocation of the map sizes. They are now sized after
          the max number of queueus on the netdev in question.
      Changes v3 to v4:
        * Dropped the pr_*() patch in favor of Yonghong Song's patch set
        * Addressed the review comments of Daniel Borkmann, mainly leaking
          of file descriptors at clean up and making the data plane APIs
          all static inline (with the exception of xsk_umem__get_data that
          uses an internal structure I do not want to expose).
        * Fixed the netlink callback as suggested by Maciej Fijalkowski.
        * Removed an unecessary include in the sample program as spotted by
          Ilia Fillipov.
      Changes v2 to v3:
        * Added automatic loading of a simple XDP program that routes all
          traffic on a queue up to the AF_XDP socket. This program loading
          can be disabled.
        * Updated function names to be consistent with the libbpf naming
          convention
        * Moved all code to xsk.[ch]
        * Removed all the XDP program loading code from the sample since
          this is now done by libbpf
        * The initialization functions now return a handle as suggested by
          Alexei
        * const statements added in the API where applicable.
      Changes v1 to v2:
        * Fixed cleanup of library state on error.
        * Moved API to initial version
        * Prefixed all public functions by xsk__ instead of xsk_
        * Added comment about changed default ring sizes, batch size and umem
          size in the sample application commit message
        * The library now only creates an Rx or Tx ring if the respective
          parameter is != NULL
      ====================
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      143bdc2e
    • Magnus Karlsson's avatar
      xsk: add FAQ to facilitate for first time users · 0f4a9b7d
      Magnus Karlsson authored
      Added an FAQ section in Documentation/networking/af_xdp.rst to help
      first time users with common problems. As problems are getting
      identified, entries will be added to the FAQ.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      0f4a9b7d
    • Magnus Karlsson's avatar
      samples/bpf: convert xdpsock to use libbpf for AF_XDP access · 248c7f9c
      Magnus Karlsson authored
      This commit converts the xdpsock sample application to use the AF_XDP
      functions present in libbpf. This cuts down the size of it by nearly
      300 lines of code.
      
      The default ring sizes plus the batch size has been increased and the
      size of the umem area has decreased. This so that the sample application
      will provide higher throughput. Note also that the shared umem code
      has been removed from the sample as this is not supported by libbpf
      at this point in time.
      Tested-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      248c7f9c
    • Magnus Karlsson's avatar
      libbpf: add support for using AF_XDP sockets · 1cad0788
      Magnus Karlsson authored
      This commit adds AF_XDP support to libbpf. The main reason for this is
      to facilitate writing applications that use AF_XDP by offering
      higher-level APIs that hide many of the details of the AF_XDP
      uapi. This is in the same vein as libbpf facilitates XDP adoption by
      offering easy-to-use higher level interfaces of XDP
      functionality. Hopefully this will facilitate adoption of AF_XDP, make
      applications using it simpler and smaller, and finally also make it
      possible for applications to benefit from optimizations in the AF_XDP
      user space access code. Previously, people just copied and pasted the
      code from the sample application into their application, which is not
      desirable.
      
      The interface is composed of two parts:
      
      * Low-level access interface to the four rings and the packet
      * High-level control plane interface for creating and setting
        up umems and af_xdp sockets as well as a simple XDP program.
      Tested-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      1cad0788