1. 20 Oct, 2018 4 commits
    • Daniel Borkmann's avatar
      Merge branch 'bpf-msg-push-data' · 2576b967
      Daniel Borkmann authored
      John Fastabend says:
      
      ====================
      This series adds a new helper bpf_msg_push_data to be used by
      sk_msg programs. The helper can be used to insert extra bytes into
      the message that can then be used by the program as metadata tags
      among other things.
      
      The first patch adds the helper, second patch the libbpf support,
      and last patch updates test_sockmap to run msg_push_data tests.
      
      v2: rebase after queue map and in filter.c convert int -> u32
      ====================
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      2576b967
    • John Fastabend's avatar
      bpf: test_sockmap add options to use msg_push_data · 84fbfe02
      John Fastabend authored
      Add options to run msg_push_data, this patch creates two more flags
      in test_sockmap that can be used to specify the offset and length
      of bytes to be added. The new options are --txmsg_start_push to
      specify where bytes should be inserted and --txmsg_end_push to
      specify how many bytes. This is analagous to the options that are
      used to pull data, --txmsg_start and --txmsg_end.
      
      In addition to adding the options tests are added to the test
      suit to run the tests similar to what was done for msg_pull_data.
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      84fbfe02
    • John Fastabend's avatar
      bpf: libbpf support for msg_push_data · f908d26b
      John Fastabend authored
      Add support for new bpf_msg_push_data in libbpf.
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      f908d26b
    • John Fastabend's avatar
      bpf: sk_msg program helper bpf_msg_push_data · 6fff607e
      John Fastabend authored
      This allows user to push data into a msg using sk_msg program types.
      The format is as follows,
      
      	bpf_msg_push_data(msg, offset, len, flags)
      
      this will insert 'len' bytes at offset 'offset'. For example to
      prepend 10 bytes at the front of the message the user can,
      
      	bpf_msg_push_data(msg, 0, 10, 0);
      
      This will invalidate data bounds so BPF user will have to then recheck
      data bounds after calling this. After this the msg size will have been
      updated and the user is free to write into the added bytes. We allow
      any offset/len as long as it is within the (data, data_end) range.
      However, a copy will be required if the ring is full and its possible
      for the helper to fail with ENOMEM or EINVAL errors which need to be
      handled by the BPF program.
      
      This can be used similar to XDP metadata to pass data between sk_msg
      layer and lower layers.
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      6fff607e
  2. 19 Oct, 2018 17 commits
    • John Fastabend's avatar
      bpf: skmsg, fix psock create on existing kcm/tls port · 5032d079
      John Fastabend authored
      Before using the psock returned by sk_psock_get() when adding it to a
      sockmap we need to ensure it is actually a sockmap based psock.
      Previously we were only checking this after incrementing the reference
      counter which was an error. This resulted in a slab-out-of-bounds
      error when the psock was not actually a sockmap type.
      
      This moves the check up so the reference counter is only used
      if it is a sockmap psock.
      
      Eric reported the following KASAN BUG,
      
      BUG: KASAN: slab-out-of-bounds in atomic_read include/asm-generic/atomic-instrumented.h:21 [inline]
      BUG: KASAN: slab-out-of-bounds in refcount_inc_not_zero_checked+0x97/0x2f0 lib/refcount.c:120
      Read of size 4 at addr ffff88019548be58 by task syz-executor4/22387
      
      CPU: 1 PID: 22387 Comm: syz-executor4 Not tainted 4.19.0-rc7+ #264
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x1c4/0x2b4 lib/dump_stack.c:113
       print_address_description.cold.8+0x9/0x1ff mm/kasan/report.c:256
       kasan_report_error mm/kasan/report.c:354 [inline]
       kasan_report.cold.9+0x242/0x309 mm/kasan/report.c:412
       check_memory_region_inline mm/kasan/kasan.c:260 [inline]
       check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267
       kasan_check_read+0x11/0x20 mm/kasan/kasan.c:272
       atomic_read include/asm-generic/atomic-instrumented.h:21 [inline]
       refcount_inc_not_zero_checked+0x97/0x2f0 lib/refcount.c:120
       sk_psock_get include/linux/skmsg.h:379 [inline]
       sock_map_link.isra.6+0x41f/0xe30 net/core/sock_map.c:178
       sock_hash_update_common+0x19b/0x11e0 net/core/sock_map.c:669
       sock_hash_update_elem+0x306/0x470 net/core/sock_map.c:738
       map_update_elem+0x819/0xdf0 kernel/bpf/syscall.c:818
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Reported-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Fixes: 604326b4 ("bpf, sockmap: convert to generic sk_msg interface")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      5032d079
    • Alexei Starovoitov's avatar
      bpf: remove unused variable · 540fefc0
      Alexei Starovoitov authored
      fix the following warning
      ../kernel/bpf/syscall.c: In function ‘map_lookup_and_delete_elem’:
      ../kernel/bpf/syscall.c:1010:22: warning: unused variable ‘ptr’ [-Wunused-variable]
        void *key, *value, *ptr;
                            ^~~
      
      Fixes: bd513cd0 ("bpf: add MAP_LOOKUP_AND_DELETE_ELEM syscall")
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      540fefc0
    • Alexei Starovoitov's avatar
      Merge branch 'cg_skb_direct_pkt_access' · d375e344
      Alexei Starovoitov authored
      Song Liu says:
      
      ====================
      Changes v7 -> v8:
      1. Dynamically allocate the dummy sk to avoid race conditions.
      
      Changes v6 -> v7:
      1. Make dummy sk a global variable (test_run_sk).
      
      Changes v5 -> v6:
      1. Fixed dummy sk in bpf_prog_test_run_skb() as suggested by Eric Dumazet.
      
      Changes v4 -> v5:
      1. Replaced bpf_compute_and_save_data_pointers() with
         bpf_compute_and_save_data_end();
         Replaced bpf_restore_data_pointers() with bpf_restore_data_end().
      2. Fixed indentation in test_verifier.c
      
      Changes v3 -> v4:
      1. Fixed crash issue reported by Alexei.
      
      Changes v2 -> v3:
      1. Added helper function bpf_compute_and_save_data_pointers() and
         bpf_restore_data_pointers().
      
      Changes v1 -> v2:
      1. Updated the list of read-only fields, and read-write fields.
      2. Added dummy sk to bpf_prog_test_run_skb().
      
      This set enables BPF program of type BPF_PROG_TYPE_CGROUP_SKB to access
      some __skb_buff data directly.
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      d375e344
    • Song Liu's avatar
      bpf: add tests for direct packet access from CGROUP_SKB · 2cb494a3
      Song Liu authored
      Tests are added to make sure CGROUP_SKB cannot access:
        tc_classid, data_meta, flow_keys
      
      and can read and write:
        mark, prority, and cb[0-4]
      
      and can read other fields.
      
      To make selftest with skb->sk work, a dummy sk is added in
      bpf_prog_test_run_skb().
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      2cb494a3
    • Song Liu's avatar
      bpf: add cg_skb_is_valid_access for BPF_PROG_TYPE_CGROUP_SKB · b39b5f41
      Song Liu authored
      BPF programs of BPF_PROG_TYPE_CGROUP_SKB need to access headers in the
      skb. This patch enables direct access of skb for these programs.
      
      Two helper functions bpf_compute_and_save_data_end() and
      bpf_restore_data_end() are introduced. There are used in
      __cgroup_bpf_run_filter_skb(), to compute proper data_end for the
      BPF program, and restore original data afterwards.
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      b39b5f41
    • Alexei Starovoitov's avatar
      Merge branch 'improve_perf_barriers' · 2929ad29
      Alexei Starovoitov authored
      Daniel Borkmann says:
      
      ====================
      This set first adds smp_* barrier variants to tools infrastructure
      and updates perf and libbpf to make use of them. For details, please
      see individual patches, thanks!
      
      Arnaldo, if there are no objections, could this be routed via bpf-next
      with Acked-by's due to later dependencies in libbpf? Alternatively,
      I could also get the 2nd patch out during merge window, but perhaps
      it's okay to do in one go as there shouldn't be much conflict in perf
      itself.
      
      Thanks!
      
      v1 -> v2:
        - add common helper and switch to acquire/release variants
          when possible, thanks Peter!
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      2929ad29
    • Daniel Borkmann's avatar
      bpf, libbpf: use correct barriers in perf ring buffer walk · a64af0ef
      Daniel Borkmann authored
      Given libbpf is a generic library and not restricted to x86-64 only,
      the compiler barrier in bpf_perf_event_read_simple() after fetching
      the head needs to be replaced with smp_rmb() at minimum. Also, writing
      out the tail we should use WRITE_ONCE() to avoid store tearing.
      
      Now that we have the logic in place in ring_buffer_read_head() and
      ring_buffer_write_tail() helper also used by perf tool which would
      select the correct and best variant for a given architecture (e.g.
      x86-64 can avoid CPU barriers entirely), make use of these in order
      to fix bpf_perf_event_read_simple().
      
      Fixes: d0cabbb0 ("tools: bpf: move the event reading loop to libbpf")
      Fixes: 39111695 ("samples: bpf: add bpf_perf_event_output example")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      a64af0ef
    • Daniel Borkmann's avatar
      tools, perf: add and use optimized ring_buffer_{read_head, write_tail} helpers · 09d62154
      Daniel Borkmann authored
      Currently, on x86-64, perf uses LFENCE and MFENCE (rmb() and mb(),
      respectively) when processing events from the perf ring buffer which
      is unnecessarily expensive as we can do more lightweight in particular
      given this is critical fast-path in perf.
      
      According to Peter rmb()/mb() were added back then via a94d342b
      ("tools/perf: Add required memory barriers") at a time where kernel
      still supported chips that needed it, but nowadays support for these
      has been ditched completely, therefore we can fix them up as well.
      
      While for x86-64, replacing rmb() and mb() with smp_*() variants would
      result in just a compiler barrier for the former and LOCK + ADD for
      the latter (__sync_synchronize() uses slower MFENCE by the way), Peter
      suggested we can use smp_{load_acquire,store_release}() instead for
      architectures where its implementation doesn't resolve in slower smp_mb().
      Thus, e.g. in x86-64 we would be able to avoid CPU barrier entirely due
      to TSO. For architectures where the latter needs to use smp_mb() e.g.
      on arm, we stick to cheaper smp_rmb() variant for fetching the head.
      
      This work adds helpers ring_buffer_read_head() and ring_buffer_write_tail()
      for tools infrastructure that either switches to smp_load_acquire() for
      architectures where it is cheaper or uses READ_ONCE() + smp_rmb() barrier
      for those where it's not in order to fetch the data_head from the perf
      control page, and it uses smp_store_release() to write the data_tail.
      Latter is smp_mb() + WRITE_ONCE() combination or a cheaper variant if
      architecture allows for it. Those that rely on smp_rmb() and smp_mb() can
      further improve performance in a follow up step by implementing the two
      under tools/arch/*/include/asm/barrier.h such that they don't have to
      fallback to rmb() and mb() in tools/include/asm/barrier.h.
      
      Switch perf to use ring_buffer_read_head() and ring_buffer_write_tail()
      so it can make use of the optimizations. Later, we convert libbpf as
      well to use the same helpers.
      
      Side note [0]: the topic has been raised of whether one could simply use
      the C11 gcc builtins [1] for the smp_load_acquire() and smp_store_release()
      instead:
      
        __atomic_load_n(ptr, __ATOMIC_ACQUIRE);
        __atomic_store_n(ptr, val, __ATOMIC_RELEASE);
      
      Kernel and (presumably) tooling shipped along with the kernel has a
      minimum requirement of being able to build with gcc-4.6 and the latter
      does not have C11 builtins. While generally the C11 memory models don't
      align with the kernel's, the C11 load-acquire and store-release alone
      /could/ suffice, however. Issue is that this is implementation dependent
      on how the load-acquire and store-release is done by the compiler and
      the mapping of supported compilers must align to be compatible with the
      kernel's implementation, and thus needs to be verified/tracked on a
      case by case basis whether they match (unless an architecture uses them
      also from kernel side). The implementations for smp_load_acquire() and
      smp_store_release() in this patch have been adapted from the kernel side
      ones to have a concrete and compatible mapping in place.
      
        [0] http://patchwork.ozlabs.org/patch/985422/
        [1] https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.htmlSigned-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      09d62154
    • Anders Roxell's avatar
      selftests/bpf: add missing executables to .gitignore · 78de3546
      Anders Roxell authored
      Fixes: 371e4fcc ("selftests/bpf: cgroup local storage-based network counters")
      Fixes: 370920c4 ("selftests/bpf: Test libbpf_{prog,attach}_type_by_name")
      Signed-off-by: default avatarAnders Roxell <anders.roxell@linaro.org>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      78de3546
    • Alexei Starovoitov's avatar
      Merge branch 'queue_stack_maps' · 43ed375f
      Alexei Starovoitov authored
      Mauricio Vasquez says:
      
      ====================
      In some applications this is needed have a pool of free elements, for
      example the list of free L4 ports in a SNAT.  None of the current maps allow
      to do it as it is not possible to get any element without having they key
      it is associated to, even if it were possible, the lack of locking mecanishms in
      eBPF would do it almost impossible to be implemented without data races.
      
      This patchset implements two new kind of eBPF maps: queue and stack.
      Those maps provide to eBPF programs the peek, push and pop operations, and for
      userspace applications a new bpf_map_lookup_and_delete_elem() is added.
      Signed-off-by: default avatarMauricio Vasquez B <mauricio.vasquez@polito.it>
      
      v2 -> v3:
       - Remove "almost dead code" in syscall.c
       - Remove unnecessary copy_from_user in bpf_map_lookup_and_delete_elem
       - Rebase
      
      v1 -> v2:
       - Put ARG_PTR_TO_UNINIT_MAP_VALUE logic into a separated patch
       - Fix missing __this_cpu_dec & preempt_enable calls in kernel/bpf/syscall.c
      
      RFC v4 -> v1:
       - Remove roundup to power of 2 in memory allocation
       - Remove count and use a free slot to check if queue/stack is empty
       - Use if + assigment for wrapping indexes
       - Fix some minor style issues
       - Squash two patches together
      
      RFC v3 -> RFC v4:
       - Revert renaming of kernel/bpf/stackmap.c
       - Remove restriction on value size
       - Remove len arguments from peek/pop helpers
       - Add new ARG_PTR_TO_UNINIT_MAP_VALUE
      
      RFC v2 -> RFC v3:
       - Return elements by value instead that by reference
       - Implement queue/stack base on array and head + tail indexes
       - Rename stack trace related files to avoid confusion and conflicts
      
      RFC v1 -> RFC v2:
       - Create two separate maps instead of single one + flags
       - Implement bpf_map_lookup_and_delete syscall
       - Support peek operation
       - Define replacement policy through flags in the update() method
       - Add eBPF side tests
      ====================
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      43ed375f
    • Mauricio Vasquez B's avatar
      selftests/bpf: add test cases for queue and stack maps · 43b987d2
      Mauricio Vasquez B authored
      test_maps:
      Tests that queue/stack maps are behaving correctly even in corner cases
      
      test_progs:
      Tests new ebpf helpers
      Signed-off-by: default avatarMauricio Vasquez B <mauricio.vasquez@polito.it>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      43b987d2
    • Mauricio Vasquez B's avatar
      Sync uapi/bpf.h to tools/include · da4e1b15
      Mauricio Vasquez B authored
      Sync both files.
      Signed-off-by: default avatarMauricio Vasquez B <mauricio.vasquez@polito.it>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      da4e1b15
    • Mauricio Vasquez B's avatar
      bpf: add MAP_LOOKUP_AND_DELETE_ELEM syscall · bd513cd0
      Mauricio Vasquez B authored
      The previous patch implemented a bpf queue/stack maps that
      provided the peek/pop/push functions.  There is not a direct
      relationship between those functions and the current maps
      syscalls, hence a new MAP_LOOKUP_AND_DELETE_ELEM syscall is added,
      this is mapped to the pop operation in the queue/stack maps
      and it is still to implement in other kind of maps.
      Signed-off-by: default avatarMauricio Vasquez B <mauricio.vasquez@polito.it>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      bd513cd0
    • Mauricio Vasquez B's avatar
      bpf: add queue and stack maps · f1a2e44a
      Mauricio Vasquez B authored
      Queue/stack maps implement a FIFO/LIFO data storage for ebpf programs.
      These maps support peek, pop and push operations that are exposed to eBPF
      programs through the new bpf_map[peek/pop/push] helpers.  Those operations
      are exposed to userspace applications through the already existing
      syscalls in the following way:
      
      BPF_MAP_LOOKUP_ELEM            -> peek
      BPF_MAP_LOOKUP_AND_DELETE_ELEM -> pop
      BPF_MAP_UPDATE_ELEM            -> push
      
      Queue/stack maps are implemented using a buffer, tail and head indexes,
      hence BPF_F_NO_PREALLOC is not supported.
      
      As opposite to other maps, queue and stack do not use RCU for protecting
      maps values, the bpf_map[peek/pop] have a ARG_PTR_TO_UNINIT_MAP_VALUE
      argument that is a pointer to a memory zone where to save the value of a
      map.  Basically the same as ARG_PTR_TO_UNINIT_MEM, but the size has not
      be passed as an extra argument.
      
      Our main motivation for implementing queue/stack maps was to keep track
      of a pool of elements, like network ports in a SNAT, however we forsee
      other use cases, like for exampling saving last N kernel events in a map
      and then analysing from userspace.
      Signed-off-by: default avatarMauricio Vasquez B <mauricio.vasquez@polito.it>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      f1a2e44a
    • Mauricio Vasquez B's avatar
      bpf/verifier: add ARG_PTR_TO_UNINIT_MAP_VALUE · 2ea864c5
      Mauricio Vasquez B authored
      ARG_PTR_TO_UNINIT_MAP_VALUE argument is a pointer to a memory zone
      used to save the value of a map.  Basically the same as
      ARG_PTR_TO_UNINIT_MEM, but the size has not be passed as an extra
      argument.
      
      This will be used in the following patch that implements some new
      helpers that receive a pointer to be filled with a map value.
      Signed-off-by: default avatarMauricio Vasquez B <mauricio.vasquez@polito.it>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      2ea864c5
    • Mauricio Vasquez B's avatar
      bpf/syscall: allow key to be null in map functions · c9d29f46
      Mauricio Vasquez B authored
      This commit adds the required logic to allow key being NULL
      in case the key_size of the map is 0.
      
      A new __bpf_copy_key function helper only copies the key from
      userpsace when key_size != 0, otherwise it enforces that key must be
      null.
      Signed-off-by: default avatarMauricio Vasquez B <mauricio.vasquez@polito.it>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c9d29f46
    • Mauricio Vasquez B's avatar
      bpf: rename stack trace map operations · 14499160
      Mauricio Vasquez B authored
      In the following patches queue and stack maps (FIFO and LIFO
      datastructures) will be implemented.  In order to avoid confusion and
      a possible name clash rename stack_map_ops to stack_trace_map_ops
      Signed-off-by: default avatarMauricio Vasquez B <mauricio.vasquez@polito.it>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      14499160
  3. 18 Oct, 2018 3 commits
  4. 17 Oct, 2018 5 commits
  5. 16 Oct, 2018 11 commits
    • Alexei Starovoitov's avatar
      Merge branch 'nfp-improve-bpf-offload' · 9032c10e
      Alexei Starovoitov authored
      Jakub Kicinski says:
      
      ====================
      this set adds check to make sure offload behaviour is correct.
      First when atomic counters are used, we must make sure the map
      does not already contain data we did not prepare for holding
      atomics.
      
      Second patch double checks vNIC capabilities for program offload
      in case program is shared by multiple vNICs with different
      constraints.
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      9032c10e
    • Jakub Kicinski's avatar
      nfp: bpf: double check vNIC capabilities after object sharing · 44b6fed0
      Jakub Kicinski authored
      Program translation stage checks that program can be offloaded to
      the netdev which was passed during the load (bpf_attr->prog_ifindex).
      After program sharing was introduced, however, the netdev on which
      program is loaded can theoretically be different, and therefore
      we should recheck the program size and max stack size at load time.
      
      This was found by code inspection, AFAIK today all vNICs have
      identical caps.
      Signed-off-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: default avatarQuentin Monnet <quentin.monnet@netronome.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      44b6fed0
    • Jakub Kicinski's avatar
      nfp: bpf: protect against mis-initializing atomic counters · 527db74b
      Jakub Kicinski authored
      Atomic operations on the NFP are currently always in big endian.
      The driver keeps track of regions of memory storing atomic values
      and byte swaps them accordingly.  There are corner cases where
      the map values may be initialized before the driver knows they
      are used as atomic counters.  This can happen either when the
      datapath is performing the update and the stack contents are
      unknown or when map is updated before the program which will
      use it for atomic values is loaded.
      
      To avoid situation where user initializes the value to 0 1 2 3
      and then after loading a program which uses the word as an atomic
      counter starts reading 3 2 1 0 - only allow atomic counters to be
      initialized to endian-neutral values.
      
      For updates from the datapath the stack information may not be
      as precise, so just allow initializing such values to 0.
      
      Example code which would break:
      struct bpf_map_def SEC("maps") rxcnt = {
             .type = BPF_MAP_TYPE_HASH,
             .key_size = sizeof(__u32),
             .value_size = sizeof(__u64),
             .max_entries = 1,
      };
      
      int xdp_prog1()
      {
            	__u64 nonzeroval = 3;
      	__u32 key = 0;
      	__u64 *value;
      
      	value = bpf_map_lookup_elem(&rxcnt, &key);
      	if (!value)
      		bpf_map_update_elem(&rxcnt, &key, &nonzeroval, BPF_ANY);
      	else
      		__sync_fetch_and_add(value, 1);
      
      	return XDP_PASS;
      }
      
      $ offload bpftool map dump
      key: 00 00 00 00 value: 00 00 00 03 00 00 00 00
      
      should be:
      
      $ offload bpftool map dump
      key: 00 00 00 00 value: 03 00 00 00 00 00 00 00
      Reported-by: default avatarDavid Beckett <david.beckett@netronome.com>
      Signed-off-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: default avatarQuentin Monnet <quentin.monnet@netronome.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      527db74b
    • Andrey Ignatov's avatar
      libbpf: Per-symbol visibility for DSO · ab9e0848
      Andrey Ignatov authored
      Make global symbols in libbpf DSO hidden by default with
      -fvisibility=hidden and export symbols that are part of ABI explicitly
      with __attribute__((visibility("default"))).
      
      This is common practice that should prevent from accidentally exporting
      a symbol, that is not supposed to be a part of ABI what, in turn,
      improves both libbpf developer- and user-experiences. See [1] for more
      details.
      
      Export control becomes more important since more and more projects use
      libbpf.
      
      The patch doesn't export a bunch of netlink related functions since as
      agreed in [2] they'll be reworked. That doesn't break bpftool since
      bpftool links libbpf statically.
      
      [1] https://www.akkadia.org/drepper/dsohowto.pdf (2.2 Export Control)
      [2] https://www.mail-archive.com/netdev@vger.kernel.org/msg251434.htmlSigned-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      ab9e0848
    • Daniel Borkmann's avatar
      bpf, tls: add tls header to tools infrastructure · 421f4292
      Daniel Borkmann authored
      Andrey reported a build error for the BPF kselftest suite when compiled on
      a machine which does not have tls related header bits installed natively:
      
        test_sockmap.c:120:23: fatal error: linux/tls.h: No such file or directory
         #include <linux/tls.h>
                               ^
        compilation terminated.
      
      Fix it by adding the header to the tools include infrastructure and add
      definitions such as SOL_TLS that could potentially be missing.
      
      Fixes: e9dd9047 ("bpf: add tls support for testing in test_sockmap")
      Reported-by: default avatarAndrey Ignatov <rdna@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      421f4292
    • David S. Miller's avatar
      Merge branch 'net-Kernel-side-filtering-for-route-dumps' · 2c59f06c
      David S. Miller authored
      David Ahern says:
      
      ====================
      net: Kernel side filtering for route dumps
      
      Implement kernel side filtering of route dumps by protocol (e.g., which
      routing daemon installed the route), route type (e.g., unicast), table
      id and nexthop device.
      
      iproute2 has been doing this filtering in userspace for years; pushing
      the filters to the kernel side reduces the amount of data the kernel
      sends and reduces wasted cycles on both sides processing unwanted data.
      These initial options provide a huge improvement for efficiently
      examining routes on large scale systems.
      
      v2
      - better handling of requests for a specific table. Rather than walking
        the hash of all tables, lookup the specific table and dump it
      - refactor mr_rtm_dumproute moving the loop over the table into a
        helper that can be invoked directly
      - add hook to return NLM_F_DUMP_FILTERED in DONE message to ensure
        it is returned even when the dump returns nothing
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2c59f06c
    • David Ahern's avatar
      net/ipv4: Bail early if user only wants prefix entries · e4e92fb1
      David Ahern authored
      Unlike IPv6, IPv4 does not have routes marked with RTF_PREFIX_RT. If the
      flag is set in the dump request, just return.
      
      In the process of this change, move the CLONE check to use the new
      filter flags.
      Signed-off-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e4e92fb1
    • David Ahern's avatar
      net/ipv6: Bail early if user only wants cloned entries · 08e814c9
      David Ahern authored
      Similar to IPv4, IPv6 fib no longer contains cloned routes. If a user
      requests a route dump for only cloned entries, no sense walking the FIB
      and returning everything.
      Signed-off-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      08e814c9
    • David Ahern's avatar
      net/mpls: Handle kernel side filtering of route dumps · 196cfebf
      David Ahern authored
      Update the dump request parsing in MPLS for the non-INET case to
      enable kernel side filtering. If INET is disabled the only filters
      that make sense for MPLS are protocol and nexthop device.
      Signed-off-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      196cfebf
    • David Ahern's avatar
      net: Enable kernel side filtering of route dumps · effe6792
      David Ahern authored
      Update parsing of route dump request to enable kernel side filtering.
      Allow filtering results by protocol (e.g., which routing daemon installed
      the route), route type (e.g., unicast), table id and nexthop device. These
      amount to the low hanging fruit, yet a huge improvement, for dumping
      routes.
      
      ip_valid_fib_dump_req is called with RTNL held, so __dev_get_by_index can
      be used to look up the device index without taking a reference. From
      there filter->dev is only used during dump loops with the lock still held.
      
      Set NLM_F_DUMP_FILTERED in the answer_flags so the user knows the results
      have been filtered should no entries be returned.
      Signed-off-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      effe6792
    • David Ahern's avatar
      net: Plumb support for filtering ipv4 and ipv6 multicast route dumps · cb167893
      David Ahern authored
      Implement kernel side filtering of routes by egress device index and
      table id. If the table id is given in the filter, lookup table and
      call mr_table_dump directly for it.
      Signed-off-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cb167893