1. 18 Apr, 2019 4 commits
  2. 17 Apr, 2019 13 commits
    • Alexei Starovoitov's avatar
      Merge branch 'af_xdp-smp_mb-fixes' · 00967e84
      Alexei Starovoitov authored
      Magnus Karlsson says:
      
      ====================
      This patch set fixes one bug and removes two dependencies on Linux
      kernel headers from the XDP socket code in libbpf. A number of people
      have pointed out that these two dependencies make it hard to build the
      XDP socket part of libbpf without any kernel header dependencies. The
      two removed dependecies are:
      
      * Remove the usage of likely and unlikely (compiler.h) in xsk.h. It
        has been reported that the use of these actually decreases the
        performance of the ring access code due to an increase in
        instruction cache misses, so let us just remove these.
      
      * Remove the dependency on barrier.h as it brings in a lot of kernel
        headers. As the XDP socket code only uses two simple functions from
        it, we can reimplement these. As a bonus, the new implementation is
        faster as it uses the same barrier primitives as the kernel does
        when the same code is compiled there. Without this patch, the user
        land code uses lfence and sfence on x86, which are unnecessarily
        harsh/thorough.
      
      In the process of removing these dependencies a missing barrier
      function for at least PPC64 was discovered. For a full explanation on
      the missing barrier, please refer to patch 1. So the patch set now
      starts with two patches fixing this. I have also added a patch at the
      end removing this full memory barrier for x86 only, as it is not
      needed there.
      
      Structure of the patch set:
      Patch 1-2: Adds the missing barrier function in kernel and user space.
      Patch 3-4: Removes the dependencies
      Patch 5: Optimizes the added barrier from patch 2 so that it does not
               do unnecessary work on x86.
      
      v2 -> v3:
      * Added missing memory barrier in ring code
      * Added an explanation on the three barriers we use in the code
      * Moved barrier functions from xsk.h to libbpf_util.h
      * Added comment on why we have these functions in libbpf_util.h
      * Added a new barrier function in user space that makes it possible to
        remove the full memory barrier on x86.
      
      v1 -> v2:
      * Added comment about validity of ARM 32-bit barriers.
        Only armv7 and above.
      
      /Magnus
      ====================
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      00967e84
    • Magnus Karlsson's avatar
      libbpf: optimize barrier for XDP socket rings · 2c5935f1
      Magnus Karlsson authored
      The full memory barrier in the XDP socket rings on the consumer side
      between the load of the data and the store of the consumer ring is
      there to protect the store from being executed before the load of the
      data. If this was allowed to happen, the producer might overwrite the
      data field with a new entry before the consumer got the chance to read
      it.
      
      On x86, stores are guaranteed not to be reordered with older loads, so
      it does not need a full memory barrier here. A compile time barrier
      would be enough. This patch introdcues a new primitive in
      libbpf_util.h that implements a new barrier type (libbpf_smp_rwmb)
      hindering stores to be reordered with older loads. It is then used in
      the XDP socket ring access code in libbpf to improve performance.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      2c5935f1
    • Magnus Karlsson's avatar
      libbpf: remove dependency on barrier.h in xsk.h · b7e3a280
      Magnus Karlsson authored
      The use of smp_rmb() and smp_wmb() creates a Linux header dependency
      on barrier.h that is unnecessary in most parts. This patch implements
      the two small defines that are needed from barrier.h. As a bonus, the
      new implementations are faster than the default ones as they default
      to sfence and lfence for x86, while we only need a compiler barrier in
      our case. Just as it is when the same ring access code is compiled in
      the kernel.
      
      Fixes: 1cad0788 ("libbpf: add support for using AF_XDP sockets")
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      b7e3a280
    • Magnus Karlsson's avatar
      libbpf: remove likely/unlikely in xsk.h · a06d7296
      Magnus Karlsson authored
      This patch removes the use of likely and unlikely in xsk.h since they
      create a dependency on Linux headers as reported by several
      users. There have also been reports that the use of these decreases
      performance as the compiler puts the code on two different cache lines
      instead of on a single one. All in all, I think we are better off
      without them.
      
      Fixes: 1cad0788 ("libbpf: add support for using AF_XDP sockets")
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      a06d7296
    • Magnus Karlsson's avatar
      libbpf: fix XDP socket ring buffer memory ordering · d5e63fdd
      Magnus Karlsson authored
      The ring buffer code of	XDP sockets is missing a memory	barrier	on the
      consumer side between the load of the data and the write that signals
      that it is ok for the producer to put new data into the buffer. On
      architectures that does not guarantee that stores are not reordered
      with older loads, the producer might put data into the ring before the
      consumer had the chance to read it. As IA does guarantee this
      ordering, it would only need a compiler barrier here, but there are no
      primitives in barrier.h for this specific case (hinder writes to be ordered
      before older reads) so I had to add a smp_mb() here which will
      translate into a run-time synch operation on IA.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      d5e63fdd
    • Magnus Karlsson's avatar
      xsk: fix XDP socket ring buffer memory ordering · f63666de
      Magnus Karlsson authored
      The ring buffer code of XDP sockets is missing a memory barrier on the
      consumer side between the load of the data and the write that signals
      that it is ok for the producer to put new data into the buffer. On
      architectures that does not guarantee that stores are not reordered
      with older loads, the producer might put data into the ring before the
      consumer had the chance to read it. As IA does guarantee this
      ordering, it would only need a compiler barrier here, but there are no
      primitives in Linux for this specific case (hinder writes to be ordered
      before older reads) so I had to add a smp_mb() here which will
      translate into a run-time synch operation on IA.
      
      Added a longish comment in the code explaining what each barrier in
      the ring implementation accomplishes and what would happen if we
      removed one of them.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      f63666de
    • Prashant Bhole's avatar
      tools/bpftool: show btf_id in map listing · d1b7725d
      Prashant Bhole authored
      Let's print btf id of map similar to the way we are printing it
      for programs.
      
      Sample output:
      user@test# bpftool map -f
      61: lpm_trie  flags 0x1
      	key 20B  value 8B  max_entries 1  memlock 4096B
      133: array  name test_btf_id  flags 0x0
      	key 4B  value 4B  max_entries 4  memlock 4096B
      	pinned /sys/fs/bpf/test100
      	btf_id 174
      170: array  name test_btf_id  flags 0x0
      	key 4B  value 4B  max_entries 4  memlock 4096B
      	btf_id 240
      Signed-off-by: default avatarPrashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
      Reviewed-by: default avatarQuentin Monnet <quentin.monnet@netronome.com>
      Reviewed-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      d1b7725d
    • Prashant Bhole's avatar
      tools/bpftool: re-organize newline printing for map listing · d459b59e
      Prashant Bhole authored
      Let's move the final newline printing in show_map_close_plain() at
      the end of the function because it looks correct and consistent with
      prog.c. Also let's do related changes for the line which prints
      pinned file name.
      Signed-off-by: default avatarPrashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
      Reviewed-by: default avatarQuentin Monnet <quentin.monnet@netronome.com>
      Reviewed-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      d459b59e
    • Andrey Ignatov's avatar
      bpftool: Support sysctl hook · f25377ee
      Andrey Ignatov authored
      Add support for recently added BPF_PROG_TYPE_CGROUP_SYSCTL program type
      and BPF_CGROUP_SYSCTL attach type.
      
      Example of bpftool output with sysctl program from selftests:
      
        # bpftool p load ./test_sysctl_prog.o /mnt/bpf/sysctl_prog type cgroup/sysctl
        # bpftool p l
        9: cgroup_sysctl  name sysctl_tcp_mem  tag 0dd05f81a8d0d52e  gpl
                loaded_at 2019-04-16T12:57:27-0700  uid 0
                xlated 1008B  jited 623B  memlock 4096B
        # bpftool c a /mnt/cgroup2/bla sysctl id 9
        # bpftool c t
        CgroupPath
        ID       AttachType      AttachFlags     Name
        /mnt/cgroup2/bla
            9        sysctl                          sysctl_tcp_mem
        # bpftool c d /mnt/cgroup2/bla sysctl id 9
        # bpftool c t
        CgroupPath
        ID       AttachType      AttachFlags     Name
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Acked-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      f25377ee
    • Andrii Nakryiko's avatar
      libbpf: fix printf formatter for ptrdiff_t argument · e1d1dc46
      Andrii Nakryiko authored
      Using %ld for printing out value of ptrdiff_t type is not portable
      between 32-bit and 64-bit archs. This is causing compilation errors for
      libbpf on 32-bit platform (discovered as part of an effort to integrate
      libbpf into systemd ([0])). Proper formatter is %td, which is used in
      this patch.
      
      v2->v1:
        - add Reported-by
        - provide more context on how this issue was discovered
      
      [0] https://github.com/systemd/systemd/pull/12151Reported-by: default avatarEvgeny Vereshchagin <evvers@ya.ru>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Alexei Starovoitov <ast@fb.com>
      Cc: Yonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      e1d1dc46
    • Prashant Bhole's avatar
      bpf: use BPF_CAST_CALL for casting bpf call · 0d306c31
      Prashant Bhole authored
      verifier.c uses BPF_CAST_CALL for casting bpf call except at one
      place in jit_subprogs(). Let's use the macro for consistency.
      Signed-off-by: default avatarPrashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      0d306c31
    • Viet Hoang Tran's avatar
      bpf: allow clearing all sock_ops callback flags · 725721a6
      Viet Hoang Tran authored
      The helper function bpf_sock_ops_cb_flags_set() can be used to both
      set and clear the sock_ops callback flags. However, its current
      behavior is not consistent. BPF program may clear a flag if more than
      one were set, or replace a flag with another one, but cannot clear all
      flags.
      
      This patch also updates the documentation to clarify the ability to
      clear flags of this helper function.
      Signed-off-by: default avatarHoang Tran <hoang.tran@uclouvain.be>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      725721a6
    • Peter Oskolkov's avatar
      selftests: bpf: add VRF test cases to lwt_ip_encap test. · 809041e7
      Peter Oskolkov authored
      This patch adds tests validating that VRF and BPF-LWT
      encap work together well, as requested by David Ahern.
      Signed-off-by: default avatarPeter Oskolkov <posk@google.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      809041e7
  3. 16 Apr, 2019 16 commits
  4. 13 Apr, 2019 4 commits
  5. 12 Apr, 2019 3 commits
    • Andrey Ignatov's avatar
      bpf: Fix distinct pointer types warning for ARCH=i386 · 51356ac8
      Andrey Ignatov authored
      Fix a new warning reported by kbuild for make ARCH=i386:
      
         In file included from kernel/bpf/cgroup.c:11:0:
         kernel/bpf/cgroup.c: In function '__cgroup_bpf_run_filter_sysctl':
         include/linux/kernel.h:827:29: warning: comparison of distinct pointer types lacks a cast
            (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
                                      ^
         include/linux/kernel.h:841:4: note: in expansion of macro '__typecheck'
            (__typecheck(x, y) && __no_side_effects(x, y))
             ^~~~~~~~~~~
         include/linux/kernel.h:851:24: note: in expansion of macro '__safe_cmp'
           __builtin_choose_expr(__safe_cmp(x, y), \
                                 ^~~~~~~~~~
         include/linux/kernel.h:860:19: note: in expansion of macro '__careful_cmp'
          #define min(x, y) __careful_cmp(x, y, <)
                            ^~~~~~~~~~~~~
      >> kernel/bpf/cgroup.c:837:17: note: in expansion of macro 'min'
            ctx.new_len = min(PAGE_SIZE, *pcount);
                          ^~~
      
      Fixes: 4e63acdf ("bpf: Introduce bpf_sysctl_{get,set}_new_value helpers")
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      51356ac8
    • Alexei Starovoitov's avatar
      Merge branch 'bpf-sysctl-hook' · a43d0508
      Alexei Starovoitov authored
      Andrey Ignatov says:
      
      ====================
      v2->v3:
      - simplify C based selftests by relying on variable offset stack access.
      
      v1->v2:
      - add fs/proc/proc_sysctl.c mainteners to Cc:.
      
      The patch set introduces new BPF hook for sysctl.
      
      It adds new program type BPF_PROG_TYPE_CGROUP_SYSCTL and attach type
      BPF_CGROUP_SYSCTL.
      
      BPF_CGROUP_SYSCTL hook is placed before calling to sysctl's proc_handler so
      that accesses (read/write) to sysctl can be controlled for specific cgroup
      and either allowed or denied, or traced.
      
      The hook has access to sysctl name, current sysctl value and (on write
      only) to new sysctl value via corresponding helpers. New sysctl value can
      be overridden by program. Both name and values (current/new) are
      represented as strings same way they're visible in /proc/sys/. It is up to
      program to parse these strings.
      
      To help with parsing the most common kind of sysctl value, vector of
      integers, two new helpers are provided: bpf_strtol and bpf_strtoul with
      semantic similar to user space strtol(3) and strtoul(3).
      
      The hook also provides bpf_sysctl context with two fields:
      * @write indicates whether sysctl is being read (= 0) or written (= 1);
      * @file_pos is sysctl file position to read from or write to, can be
        overridden.
      
      The hook allows to make better isolation for containerized applications
      that are run as root so that one container can't change a sysctl and affect
      all other containers on a host, make changes to allowed sysctl in a safer
      way and simplify sysctl tracing for cgroups.
      
      Patch 1 is preliminary refactoring.
      Patch 2 adds new program and attach types.
      Patches 3-5 implement helpers to access sysctl name and value.
      Patch 6 adds file_pos field to bpf_sysctl context.
      Patch 7 updates UAPI in tools.
      Patches 8-9 add support for the new hook to libbpf and corresponding test.
      Patches 10-14 add selftests for the new hook.
      Patch 15 adds support for new arg types to verifier: pointer to integer.
      Patch 16 adds bpf_strto{l,ul} helpers to parse integers from sysctl value.
      Patch 17 updates UAPI in tools.
      Patch 18 updates bpf_helpers.h.
      Patch 19 adds selftests for pointer to integer in verifier.
      Patches 20-21 add selftests for bpf_strto{l,ul}, including integration
                    C based test for sysctl value parsing.
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      a43d0508
    • Andrey Ignatov's avatar
      selftests/bpf: C based test for sysctl and strtoX · 7568f4cb
      Andrey Ignatov authored
      Add C based test for a few bpf_sysctl_* helpers and bpf_strtoul.
      
      Make sure that sysctl can be identified by name and that multiple
      integers can be parsed from sysctl value with bpf_strtoul.
      
      net/ipv4/tcp_mem is chosen as a testing sysctl, it contains 3 unsigned
      longs, they all are parsed and compared (val[0] < val[1] < val[2]).
      
      Example of output:
        # ./test_sysctl
        ...
        Test case: C prog: deny all writes .. [PASS]
        Test case: C prog: deny access by name .. [PASS]
        Test case: C prog: read tcp_mem .. [PASS]
        Summary: 39 PASSED, 0 FAILED
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      7568f4cb