1. 17 Jun, 2019 7 commits
    • Andrii Nakryiko's avatar
      libbpf: allow specifying map definitions using BTF · abd29c93
      Andrii Nakryiko authored
      This patch adds support for a new way to define BPF maps. It relies on
      BTF to describe mandatory and optional attributes of a map, as well as
      captures type information of key and value naturally. This eliminates
      the need for BPF_ANNOTATE_KV_PAIR hack and ensures key/value sizes are
      always in sync with the key/value type.
      
      Relying on BTF, this approach allows for both forward and backward
      compatibility w.r.t. extending supported map definition features. By
      default, any unrecognized attributes are treated as an error, but it's
      possible relax this using MAPS_RELAX_COMPAT flag. New attributes, added
      in the future will need to be optional.
      
      The outline of the new map definition (short, BTF-defined maps) is as follows:
      1. All the maps should be defined in .maps ELF section. It's possible to
         have both "legacy" map definitions in `maps` sections and BTF-defined
         maps in .maps sections. Everything will still work transparently.
      2. The map declaration and initialization is done through
         a global/static variable of a struct type with few mandatory and
         extra optional fields:
         - type field is mandatory and specified type of BPF map;
         - key/value fields are mandatory and capture key/value type/size information;
         - max_entries attribute is optional; if max_entries is not specified or
           initialized, it has to be provided in runtime through libbpf API
           before loading bpf_object;
         - map_flags is optional and if not defined, will be assumed to be 0.
      3. Key/value fields should be **a pointer** to a type describing
         key/value. The pointee type is assumed (and will be recorded as such
         and used for size determination) to be a type describing key/value of
         the map. This is done to save excessive amounts of space allocated in
         corresponding ELF sections for key/value of big size.
      4. As some maps disallow having BTF type ID associated with key/value,
         it's possible to specify key/value size explicitly without
         associating BTF type ID with it. Use key_size and value_size fields
         to do that (see example below).
      
      Here's an example of simple ARRAY map defintion:
      
      struct my_value { int x, y, z; };
      
      struct {
      	int type;
      	int max_entries;
      	int *key;
      	struct my_value *value;
      } btf_map SEC(".maps") = {
      	.type = BPF_MAP_TYPE_ARRAY,
      	.max_entries = 16,
      };
      
      This will define BPF ARRAY map 'btf_map' with 16 elements. The key will
      be of type int and thus key size will be 4 bytes. The value is struct
      my_value of size 12 bytes. This map can be used from C code exactly the
      same as with existing maps defined through struct bpf_map_def.
      
      Here's an example of STACKMAP definition (which currently disallows BTF type
      IDs for key/value):
      
      struct {
      	__u32 type;
      	__u32 max_entries;
      	__u32 map_flags;
      	__u32 key_size;
      	__u32 value_size;
      } stackmap SEC(".maps") = {
      	.type = BPF_MAP_TYPE_STACK_TRACE,
      	.max_entries = 128,
      	.map_flags = BPF_F_STACK_BUILD_ID,
      	.key_size = sizeof(__u32),
      	.value_size = PERF_MAX_STACK_DEPTH * sizeof(struct bpf_stack_build_id),
      };
      
      This approach is naturally extended to support map-in-map, by making a value
      field to be another struct that describes inner map. This feature is not
      implemented yet. It's also possible to incrementally add features like pinning
      with full backwards and forward compatibility. Support for static
      initialization of BPF_MAP_TYPE_PROG_ARRAY using pointers to BPF programs
      is also on the roadmap.
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      abd29c93
    • Andrii Nakryiko's avatar
      libbpf: split initialization and loading of BTF · 063183bf
      Andrii Nakryiko authored
      Libbpf does sanitization of BTF before loading it into kernel, if kernel
      doesn't support some of newer BTF features. This removes some of the
      important information from BTF (e.g., DATASEC and VAR description),
      which will be used for map construction. This patch splits BTF
      processing into initialization step, in which BTF is initialized from
      ELF and all the original data is still preserved; and
      sanitization/loading step, which ensures that BTF is safe to load into
      kernel. This allows to use full BTF information to construct maps, while
      still loading valid BTF into older kernels.
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      063183bf
    • Andrii Nakryiko's avatar
      libbpf: identify maps by section index in addition to offset · db48814b
      Andrii Nakryiko authored
      To support maps to be defined in multiple sections, it's important to
      identify map not just by offset within its section, but section index as
      well. This patch adds tracking of section index.
      
      For global data, we record section index of corresponding
      .data/.bss/.rodata ELF section for uniformity, and thus don't need
      a special value of offset for those maps.
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      db48814b
    • Andrii Nakryiko's avatar
      libbpf: refactor map initialization · bf829271
      Andrii Nakryiko authored
      User and global data maps initialization has gotten pretty complicated
      and unnecessarily convoluted. This patch splits out the logic for global
      data map and user-defined map initialization. It also removes the
      restriction of pre-calculating how many maps will be initialized,
      instead allowing to keep adding new maps as they are discovered, which
      will be used later for BTF-defined map definitions.
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      bf829271
    • Andrii Nakryiko's avatar
      libbpf: streamline ELF parsing error-handling · 01b29d1d
      Andrii Nakryiko authored
      Simplify ELF parsing logic by exiting early, as there is no common clean
      up path to execute. That makes it unnecessary to track when err was set
      and when it was cleared. It also reduces nesting in some places.
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      01b29d1d
    • Andrii Nakryiko's avatar
      libbpf: extract BTF loading logic · 9c6660d0
      Andrii Nakryiko authored
      As a preparation for adding BTF-based BPF map loading, extract .BTF and
      .BTF.ext loading logic.
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      9c6660d0
    • Andrii Nakryiko's avatar
      libbpf: add common min/max macro to libbpf_internal.h · d7fe74f9
      Andrii Nakryiko authored
      Multiple files in libbpf redefine their own definitions for min/max.
      Let's define them in libbpf_internal.h and use those everywhere.
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      d7fe74f9
  2. 14 Jun, 2019 11 commits
  3. 12 Jun, 2019 1 commit
    • Valdis Klētnieks's avatar
      bpf: silence warning messages in core · aee450cb
      Valdis Klētnieks authored
      Compiling kernel/bpf/core.c with W=1 causes a flood of warnings:
      
      kernel/bpf/core.c:1198:65: warning: initialized field overwritten [-Woverride-init]
       1198 | #define BPF_INSN_3_TBL(x, y, z) [BPF_##x | BPF_##y | BPF_##z] = true
            |                                                                 ^~~~
      kernel/bpf/core.c:1087:2: note: in expansion of macro 'BPF_INSN_3_TBL'
       1087 |  INSN_3(ALU, ADD,  X),   \
            |  ^~~~~~
      kernel/bpf/core.c:1202:3: note: in expansion of macro 'BPF_INSN_MAP'
       1202 |   BPF_INSN_MAP(BPF_INSN_2_TBL, BPF_INSN_3_TBL),
            |   ^~~~~~~~~~~~
      kernel/bpf/core.c:1198:65: note: (near initialization for 'public_insntable[12]')
       1198 | #define BPF_INSN_3_TBL(x, y, z) [BPF_##x | BPF_##y | BPF_##z] = true
            |                                                                 ^~~~
      kernel/bpf/core.c:1087:2: note: in expansion of macro 'BPF_INSN_3_TBL'
       1087 |  INSN_3(ALU, ADD,  X),   \
            |  ^~~~~~
      kernel/bpf/core.c:1202:3: note: in expansion of macro 'BPF_INSN_MAP'
       1202 |   BPF_INSN_MAP(BPF_INSN_2_TBL, BPF_INSN_3_TBL),
            |   ^~~~~~~~~~~~
      
      98 copies of the above.
      
      The attached patch silences the warnings, because we *know* we're overwriting
      the default initializer. That leaves bpf/core.c with only 6 other warnings,
      which become more visible in comparison.
      Signed-off-by: default avatarValdis Kletnieks <valdis.kletnieks@vt.edu>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      aee450cb
  4. 11 Jun, 2019 12 commits
  5. 06 Jun, 2019 2 commits
  6. 04 Jun, 2019 2 commits
  7. 01 Jun, 2019 4 commits
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next · 0462eaac
      David S. Miller authored
      Alexei Starovoitov says:
      
      ====================
      pull-request: bpf-next 2019-05-31
      
      The following pull-request contains BPF updates for your *net-next* tree.
      
      Lots of exciting new features in the first PR of this developement cycle!
      The main changes are:
      
      1) misc verifier improvements, from Alexei.
      
      2) bpftool can now convert btf to valid C, from Andrii.
      
      3) verifier can insert explicit ZEXT insn when requested by 32-bit JITs.
         This feature greatly improves BPF speed on 32-bit architectures. From Jiong.
      
      4) cgroups will now auto-detach bpf programs. This fixes issue of thousands
         bpf programs got stuck in dying cgroups. From Roman.
      
      5) new bpf_send_signal() helper, from Yonghong.
      
      6) cgroup inet skb programs can signal CN to the stack, from Lawrence.
      
      7) miscellaneous cleanups, from many developers.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0462eaac
    • Alan Maguire's avatar
      selftests/bpf: measure RTT from xdp using xdping · cd538502
      Alan Maguire authored
      xdping allows us to get latency estimates from XDP.  Output looks
      like this:
      
      ./xdping -I eth4 192.168.55.8
      Setting up XDP for eth4, please wait...
      XDP setup disrupts network connectivity, hit Ctrl+C to quit
      
      Normal ping RTT data
      [Ignore final RTT; it is distorted by XDP using the reply]
      PING 192.168.55.8 (192.168.55.8) from 192.168.55.7 eth4: 56(84) bytes of data.
      64 bytes from 192.168.55.8: icmp_seq=1 ttl=64 time=0.302 ms
      64 bytes from 192.168.55.8: icmp_seq=2 ttl=64 time=0.208 ms
      64 bytes from 192.168.55.8: icmp_seq=3 ttl=64 time=0.163 ms
      64 bytes from 192.168.55.8: icmp_seq=8 ttl=64 time=0.275 ms
      
      4 packets transmitted, 4 received, 0% packet loss, time 3079ms
      rtt min/avg/max/mdev = 0.163/0.237/0.302/0.054 ms
      
      XDP RTT data:
      64 bytes from 192.168.55.8: icmp_seq=5 ttl=64 time=0.02808 ms
      64 bytes from 192.168.55.8: icmp_seq=6 ttl=64 time=0.02804 ms
      64 bytes from 192.168.55.8: icmp_seq=7 ttl=64 time=0.02815 ms
      64 bytes from 192.168.55.8: icmp_seq=8 ttl=64 time=0.02805 ms
      
      The xdping program loads the associated xdping_kern.o BPF program
      and attaches it to the specified interface.  If run in client
      mode (the default), it will add a map entry keyed by the
      target IP address; this map will store RTT measurements, current
      sequence number etc.  Finally in client mode the ping command
      is executed, and the xdping BPF program will use the last ICMP
      reply, reformulate it as an ICMP request with the next sequence
      number and XDP_TX it.  After the reply to that request is received
      we can measure RTT and repeat until the desired number of
      measurements is made.  This is why the sequence numbers in the
      normal ping are 1, 2, 3 and 8.  We XDP_TX a modified version
      of ICMP reply 4 and keep doing this until we get the 4 replies
      we need; hence the networking stack only sees reply 8, where
      we have XDP_PASSed it upstream since we are done.
      
      In server mode (-s), xdping simply takes ICMP requests and replies
      to them in XDP rather than passing the request up to the networking
      stack.  No map entry is required.
      
      xdping can be run in native XDP mode (the default, or specified
      via -N) or in skb mode (-S).
      
      A test program test_xdping.sh exercises some of these options.
      
      Note that native XDP does not seem to XDP_TX for veths, hence -N
      is not tested.  Looking at the code, it looks like XDP_TX is
      supported so I'm not sure if that's expected.  Running xdping in
      native mode for ixgbe as both client and server works fine.
      
      Changes since v4
      
      - close fds on cleanup (Song Liu)
      
      Changes since v3
      
      - fixed seq to be __be16 (Song Liu)
      - fixed fd checks in xdping.c (Song Liu)
      
      Changes since v2
      
      - updated commit message to explain why seq number of last
        ICMP reply is 8 not 4 (Song Liu)
      - updated types of seq number, raddr and eliminated csum variable
        in xdpclient/xdpserver functions as it was not needed (Song Liu)
      - added XDPING_DEFAULT_COUNT definition and usage specification of
        default/max counts (Song Liu)
      
      Changes since v1
       - moved from RFC to PATCH
       - removed unused variable in ipv4_csum() (Song Liu)
       - refactored ICMP checks into icmp_check() function called by client
         and server programs and reworked client and server programs due
         to lack of shared code (Song Liu)
       - added checks to ensure that SKB and native mode are not requested
         together (Song Liu)
      Signed-off-by: default avatarAlan Maguire <alan.maguire@oracle.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      cd538502
    • David S. Miller's avatar
      Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue · 33aae282
      David S. Miller authored
      Jeff Kirsher says:
      
      ====================
      Intel Wired LAN Driver Updates 2019-05-31
      
      This series contains updates to the iavf driver.
      
      Nathan Chancellor converts the use of gnu_printf to printf.
      
      Aleksandr modifies the driver to limit the number of RSS queues to the
      number of online CPUs in order to avoid creating misconfigured RSS
      queues.
      
      Gustavo A. R. Silva converts a couple of instances where sizeof() can be
      replaced with struct_size().
      
      Alice makes the remaining changes to the iavf driver to cleanup all the
      old "i40evf" references in the driver to iavf, including the file names
      that still contained the old driver reference.  There was no functional
      changes made, just cosmetic to reduce any confusion going forward now
      that the iavf driver is the virtual function driver for both i40e and
      ice drivers.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      33aae282
    • Jiong Wang's avatar
      bpf: doc: update answer for 32-bit subregister question · c231c22a
      Jiong Wang authored
      There has been quite a few progress around the two steps mentioned in the
      answer to the following question:
      
        Q: BPF 32-bit subregister requirements
      
      This patch updates the answer to reflect what has been done.
      
      v2:
       - Add missing full stop. (Song Liu)
       - Minor tweak on one sentence. (Song Liu)
      
      v1:
       - Integrated rephrase from Quentin and Jakub
      Reviewed-by: default avatarQuentin Monnet <quentin.monnet@netronome.com>
      Reviewed-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: default avatarJiong Wang <jiong.wang@netronome.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c231c22a
  8. 31 May, 2019 1 commit
    • Alexei Starovoitov's avatar
      Merge branch 'map-charge-cleanup' · d168286d
      Alexei Starovoitov authored
      Roman Gushchin says:
      
      ====================
      During my work on memcg-based memory accounting for bpf maps
      I've done some cleanups and refactorings of the existing
      memlock rlimit-based code. It makes it more robust, unifies
      size to pages conversion, size checks and corresponding error
      codes. Also it adds coverage for cgroup local storage and
      socket local storage maps.
      
      It looks like some preliminary work on the mm side might be
      required to start working on the memcg-based accounting,
      so I'm sending these patches as a separate patchset.
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      d168286d