Commits · 8259fdeb3032621c7cc7aa3f2676ffd470303305 · Kirill Smelkov / linux

28 Jan, 2021 2 commits

selftests/bpf: Verify that rebinding to port < 1024 from BPF works · 8259fdeb

Stanislav Fomichev authored Jan 27, 2021

Return 3 to indicate that permission check for port 111
should be skipped.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20210127193140.3170382-2-sdf@google.com

8259fdeb

bpf: Allow rewriting to ports under ip_unprivileged_port_start · 77241217

Stanislav Fomichev authored Jan 27, 2021

At the moment, BPF_CGROUP_INET{4,6}_BIND hooks can rewrite user_port
to the privileged ones (< ip_unprivileged_port_start), but it will
be rejected later on in the __inet_bind or __inet6_bind.

Let's add another return value to indicate that CAP_NET_BIND_SERVICE
check should be ignored. Use the same idea as we currently use
in cgroup/egress where bit #1 indicates CN. Instead, for
cgroup/bind{4,6}, bit #1 indicates that CAP_NET_BIND_SERVICE should
be bypassed.

v5:
- rename flags to be less confusing (Andrey Ignatov)
- rework BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY to work on flags
  and accept BPF_RET_SET_CN (no behavioral changes)

v4:
- Add missing IPv6 support (Martin KaFai Lau)

v3:
- Update description (Martin KaFai Lau)
- Fix capability restore in selftest (Martin KaFai Lau)

v2:
- Switch to explicit return code (Martin KaFai Lau)
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Andrey Ignatov <rdna@fb.com>
Link: https://lore.kernel.org/bpf/20210127193140.3170382-1-sdf@google.com

77241217

27 Jan, 2021 2 commits

skmsg: Make sk_psock_destroy() static · 8063e184

Cong Wang authored Jan 27, 2021

sk_psock_destroy() is a RCU callback, I can't see any reason why
it could be used outside.
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Cc: Lorenz Bauer <lmb@cloudflare.com>
Link: https://lore.kernel.org/bpf/20210127221501.46866-1-xiyou.wangcong@gmail.com

8063e184

bpf: Change 'BPF_ADD' to 'BPF_AND' in print_bpf_insn() · 60e578e8

Menglong Dong authored Jan 26, 2021

This 'BPF_ADD' is duplicated, and I belive it should be 'BPF_AND'.

Fixes: 981f94c3 ("bpf: Add bitwise atomic instructions")
Signed-off-by: Menglong Dong <dong.menglong@zte.com.cn>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Brendan Jackman <jackmanb@google.com>
Link: https://lore.kernel.org/bpf/20210127022507.23674-1-dong.menglong@zte.com.cn

60e578e8

26 Jan, 2021 1 commit

selftests/bpf: Don't exit on failed bpf_testmod unload · 86ce322d

Andrii Nakryiko authored Jan 25, 2021

Fix bug in handling bpf_testmod unloading that will cause test_progs exiting
prematurely if bpf_testmod unloading failed. This is especially problematic
when running a subset of test_progs that doesn't require root permissions and
doesn't rely on bpf_testmod, yet will fail immediately due to exit(1) in
unload_bpf_testmod().

Fixes: 9f7fa225 ("selftests/bpf: Add bpf_testmod kernel module for testing")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210126065019.1268027-1-andrii@kernel.org

86ce322d

25 Jan, 2021 17 commits

samples/bpf: Set flag __SANE_USERSPACE_TYPES__ for MIPS to fix build warnings · 190d1c92

Tiezhu Yang authored Jan 25, 2021

There exists many build warnings when make M=samples/bpf on the Loongson
platform, this issue is MIPS related, x86 compiles just fine.

Here are some warnings:

  CC  samples/bpf/ibumad_user.o
samples/bpf/ibumad_user.c: In function ‘dump_counts’:
samples/bpf/ibumad_user.c:46:24: warning: format ‘%llu’ expects argument of type ‘long long unsigned int’, but argument 3 has type ‘__u64’ {aka ‘long unsigned int’} [-Wformat=]
    printf("0x%02x : %llu\n", key, value);
                     ~~~^          ~~~~~
                     %lu
  CC  samples/bpf/offwaketime_user.o
samples/bpf/offwaketime_user.c: In function ‘print_ksym’:
samples/bpf/offwaketime_user.c:34:17: warning: format ‘%llx’ expects argument of type ‘long long unsigned int’, but argument 3 has type ‘__u64’ {aka ‘long unsigned int’} [-Wformat=]
   printf("%s/%llx;", sym->name, addr);
              ~~~^               ~~~~
              %lx
samples/bpf/offwaketime_user.c: In function ‘print_stack’:
samples/bpf/offwaketime_user.c:68:17: warning: format ‘%lld’ expects argument of type ‘long long int’, but argument 3 has type ‘__u64’ {aka ‘long unsigned int’} [-Wformat=]
  printf(";%s %lld\n", key->waker, count);
              ~~~^                 ~~~~~
              %ld

MIPS needs __SANE_USERSPACE_TYPES__ before <linux/types.h> to select
'int-ll64.h' in arch/mips/include/uapi/asm/types.h, then it can avoid
build warnings when printing __u64 with %llu, %llx or %lld.

The header tools/include/linux/types.h defines __SANE_USERSPACE_TYPES__,
it seems that we can include <linux/types.h> in the source files which
have build warnings, but it has no effect due to actually it includes
usr/include/linux/types.h instead of tools/include/linux/types.h, the
problem is that "usr/include" is preferred first than "tools/include"
in samples/bpf/Makefile, that sounds like a ugly hack to -Itools/include
before -Iusr/include.

So define __SANE_USERSPACE_TYPES__ for MIPS in samples/bpf/Makefile
is proper, if add "TPROGS_CFLAGS += -D__SANE_USERSPACE_TYPES__" in
samples/bpf/Makefile, it appears the following error:

Auto-detecting system features:
...                        libelf: [ on  ]
...                          zlib: [ on  ]
...                           bpf: [ OFF ]

BPF API too old
make[3]: *** [Makefile:293: bpfdep] Error 1
make[2]: *** [Makefile:156: all] Error 2

With #ifndef __SANE_USERSPACE_TYPES__  in tools/include/linux/types.h,
the above error has gone and this ifndef change does not hurt other
compilations.
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/1611551146-14052-1-git-send-email-yangtiezhu@loongson.cn

190d1c92

tools, headers: Sync struct bpf_perf_event_data · 726bf76f

Florian Lehner authored Jan 23, 2021

Update struct bpf_perf_event_data with the addr field to match the
tools headers with the kernel headers.
Signed-off-by: Florian Lehner <dev@der-flo.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210123185221.23946-1-dev@der-flo.net

726bf76f

selftests/bpf: Avoid useless void *-casts · 095af986

Björn Töpel authored Jan 22, 2021

There is no need to cast to void * when the argument is void *. Avoid
cluttering of code.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210122154725.22140-13-bjorn.topel@gmail.com

095af986

selftests/bpf: Consistent malloc/calloc usage · d08a17d6

Björn Töpel authored Jan 22, 2021

Use calloc instead of malloc where it makes sense, and avoid C++-style
void *-cast.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210122154725.22140-12-bjorn.topel@gmail.com

d08a17d6

selftests/bpf: Avoid heap allocation · 93dd4a06

Björn Töpel authored Jan 22, 2021

The data variable is only used locally. Instead of using the heap,
stick to using the stack.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210122154725.22140-11-bjorn.topel@gmail.com

93dd4a06

selftests/bpf: Define local variables at the beginning of a block · 829725ec

Björn Töpel authored Jan 22, 2021

Use C89 rules for variable definition.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210122154725.22140-10-bjorn.topel@gmail.com

829725ec

selftests/bpf: Change type from void * to struct generic_data * · 59a4a87e

Björn Töpel authored Jan 22, 2021

Instead of casting from void *, let us use the actual type in
gen_udp_hdr().
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210122154725.22140-9-bjorn.topel@gmail.com

59a4a87e

selftests/bpf: Change type from void * to struct ifaceconfigobj * · 124000e4

Björn Töpel authored Jan 22, 2021

Instead of casting from void *, let us use the actual type in
init_iface_config().
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210122154725.22140-8-bjorn.topel@gmail.com

124000e4

selftests/bpf: Remove casting by introduce local variable · 0b50bd48

Björn Töpel authored Jan 22, 2021

Let us use a local variable in nsswitchthread(), so we can remove a
lot of casting for better readability.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210122154725.22140-7-bjorn.topel@gmail.com

0b50bd48

selftests/bpf: Improve readability of xdpxceiver/worker_pkt_validate() · 8a9cba7e

Björn Töpel authored Jan 22, 2021

Introduce a local variable to get rid of lot of casting. Move common
code outside the if/else-clause.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210122154725.22140-6-bjorn.topel@gmail.com

8a9cba7e

selftests/bpf: Remove memory leak · 4896d7e3

Björn Töpel authored Jan 22, 2021

The allocated entry is immediately overwritten by an assignment. Fix
that.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210122154725.22140-5-bjorn.topel@gmail.com

4896d7e3

selftests/bpf: Fix style warnings · a8607283

Björn Töpel authored Jan 22, 2021

Silence three checkpatch style warnings.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210122154725.22140-4-bjorn.topel@gmail.com

a8607283

selftests/bpf: Remove unused enums · 449f0874

Björn Töpel authored Jan 22, 2021

The enums undef and bidi are not used. Remove them.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210122154725.22140-3-bjorn.topel@gmail.com

449f0874

selftests/bpf: Remove a lot of ifobject casting · 7140ef14

Björn Töpel authored Jan 22, 2021

Instead of passing void * all over the place, let us pass the actual
type (ifobject) and remove the void-ptr-to-type-ptr casting.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210122154725.22140-2-bjorn.topel@gmail.com

7140ef14

libbpf, xsk: Select AF_XDP BPF program based on kernel version · 78ed4045

Björn Töpel authored Jan 22, 2021

Add detection for kernel version, and adapt the BPF program based on
kernel support. This way, users will get the best possible performance
from the BPF program.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Marek Majtyka  <alardam@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/bpf/20210122105351.11751-4-bjorn.topel@gmail.com

78ed4045

xsk: Fold xp_assign_dev and __xp_assign_dev · f0863eab

Björn Töpel authored Jan 22, 2021

Fold xp_assign_dev and __xp_assign_dev. The former directly calls the
latter.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/bpf/20210122105351.11751-3-bjorn.topel@gmail.com

f0863eab

xsk: Remove explicit_free parameter from __xsk_rcv() · 458f7272

Björn Töpel authored Jan 22, 2021

The explicit_free parameter of the __xsk_rcv() function was used to
mark whether the call was via the generic XDP or the native XDP
path. Instead of clutter the code with if-statements and "true/false"
parameters which are hard to understand, simply move the explicit free
to the __xsk_map_redirect() which is always called from the native XDP
path.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/bpf/20210122105351.11751-2-bjorn.topel@gmail.com

458f7272

22 Jan, 2021 3 commits

samples/bpf: Add xdp program on egress for xdp_redirect_map · 6e66fbb1

Hangbin Liu authored Jan 22, 2021

This patch add a xdp program on egress to show that we can modify
the packet on egress. In this sample we will set the pkt's src
mac to egress's mac address. The xdp_prog will be attached when
-X option supplied.
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/bpf/20210122025007.2968381-1-liuhangbin@gmail.com

6e66fbb1

bpf: Fix typo in scalar{,32}_min_max_rsh comments · 18b24d78

Tobias Klauser authored Jan 21, 2021

s/bounts/bounds/
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210121174324.24127-1-tklauser@distanz.ch

18b24d78

bpf, docs: Update build procedure for manually compiling LLVM and Clang · 628add78

Tiezhu Yang authored Jan 22, 2021

The current LLVM and Clang build procedure in samples/bpf/README.rst is
out of date. See below that the links are not accessible any more.

  $ git clone http://llvm.org/git/llvm.git
  Cloning into 'llvm'...
  fatal: unable to access 'http://llvm.org/git/llvm.git/': Maximum (20) redirects followed
  $ git clone --depth 1 http://llvm.org/git/clang.git
  Cloning into 'clang'...
  fatal: unable to access 'http://llvm.org/git/clang.git/': Maximum (20) redirects followed

The LLVM community has adopted new ways to build the compiler. There are
different ways to build LLVM and Clang, the Clang Getting Started page [1]
has one way. As Yonghong said, it is better to copy the build procedure
in Documentation/bpf/bpf_devel_QA.rst to keep consistent.

I verified the procedure and it is proved to be feasible, so we should
update README.rst to reflect the reality. At the same time, update the
related comment in Makefile.

Additionally, as Fangrui said, the dir llvm-project/llvm/build/install is
not used, BUILD_SHARED_LIBS=OFF is the default option [2], so also change
Documentation/bpf/bpf_devel_QA.rst together.

At last, we recommend that developers who want the fastest incremental
builds use the Ninja build system [1], you can find it in your system's
package manager, usually the package is ninja or ninja-build [3], so add
ninja to build dependencies suggested by Nathan.

  [1] https://clang.llvm.org/get_started.html
  [2] https://www.llvm.org/docs/CMake.html
  [3] https://github.com/ninja-build/ninja/wiki/Pre-built-Ninja-packagesSigned-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Acked-by: Yonghong Song <yhs@fb.com>
Cc: Fangrui Song <maskray@google.com>
Link: https://lore.kernel.org/bpf/1611279584-26047-1-git-send-email-yangtiezhu@loongson.cn

628add78

21 Jan, 2021 4 commits

selftest/bpf: Fix typo · 443edcef

Junlin Yang authored Jan 21, 2021

Change 'exeeds' to 'exceeds'.
Signed-off-by: Junlin Yang <yangjunlin@yulong.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210121122309.1501-1-angkery@163.com

443edcef

libbpf: Use string table index from index table if needed · 6095d5a2

Jiri Olsa authored Jan 21, 2021

For very large ELF objects (with many sections), we could
get special value SHN_XINDEX (65535) for elf object's string
table index - e_shstrndx.

Call elf_getshdrstrndx to get the proper string table index,
instead of reading it directly from ELF header.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210121202203.9346-4-jolsa@kernel.org

6095d5a2

docs: bpf: Clarify -mcpu=v3 requirement for atomic ops · b452ee00

Brendan Jackman authored Jan 20, 2021

Alexei pointed out [1] that this wording is pretty confusing. Here's
an attempt to be more explicit and clear.

[1] https://lore.kernel.org/bpf/CAADnVQJVvwoZsE1K+6qRxzF7+6CvZNzygnoBW9tZNWJELk5c=Q@mail.gmail.com/Signed-off-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210120133946.2107897-3-jackmanb@google.com

b452ee00

docs: bpf: Fixup atomics markup · 53fe5418

Brendan Jackman authored Jan 20, 2021

This fixes up the markup to fix a warning, be more consistent with
use of monospace, and use the correct .rst syntax for <em> (* instead
of _).
Signed-off-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Link: https://lore.kernel.org/bpf/20210120133946.2107897-2-jackmanb@google.com

53fe5418

20 Jan, 2021 11 commits

Merge branch 'bpf: misc performance improvements for cgroup' · 636d549f

Alexei Starovoitov authored Jan 20, 2021

Stanislav Fomichev says:

====================

First patch adds custom getsockopt for TCP_ZEROCOPY_RECEIVE
to remove kmalloc and lock_sock overhead from the dat path.

Second patch removes kzalloc/kfree from getsockopt for the common cases.

Third patch switches cgroup_bpf_enabled to be per-attach to
to add only overhead for the cgroup attach types used on the system.

No visible user-side changes.

v9:
- include linux/tcp.h instead of netinet/tcp.h in sockopt_sk.c
- note that v9 depends on the commit 4be34f3d ("bpf: Don't leak
  memory in bpf getsockopt when optlen == 0") from bpf tree

v8:
- add bpi.h to tools/include/uapi in the same patch (Martin KaFai Lau)
- kmalloc instead of kzalloc when exporting buffer (Martin KaFai Lau)
- note that v8 depends on the commit 4be34f3d ("bpf: Don't leak
  memory in bpf getsockopt when optlen == 0") from bpf tree

v7:
- add comment about buffer contents for retval != 0 (Martin KaFai Lau)
- export tcp.h into tools/include/uapi (Martin KaFai Lau)
- note that v7 depends on the commit 4be34f3d ("bpf: Don't leak
  memory in bpf getsockopt when optlen == 0") from bpf tree

v6:
- avoid indirect cost for new bpf_bypass_getsockopt (Eric Dumazet)

v5:
- reorder patches to reduce the churn (Martin KaFai Lau)

v4:
- update performance numbers
- bypass_bpf_getsockopt (Martin KaFai Lau)

v3:
- remove extra newline, add comment about sizeof tcp_zerocopy_receive
  (Martin KaFai Lau)
- add another patch to remove lock_sock overhead from
  TCP_ZEROCOPY_RECEIVE; technically, this makes patch #1 obsolete,
  but I'd still prefer to keep it to help with other socket
  options

v2:
- perf numbers for getsockopt kmalloc reduction (Song Liu)
- (sk) in BPF_CGROUP_PRE_CONNECT_ENABLED (Song Liu)
- 128 -> 64 buffer size, BUILD_BUG_ON (Martin KaFai Lau)
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

636d549f

bpf: Split cgroup_bpf_enabled per attach type · a9ed15da

Stanislav Fomichev authored Jan 15, 2021

When we attach any cgroup hook, the rest (even if unused/unattached) start
to contribute small overhead. In particular, the one we want to avoid is
__cgroup_bpf_run_filter_skb which does two redirections to get to
the cgroup and pushes/pulls skb.

Let's split cgroup_bpf_enabled to be per-attach to make sure
only used attach types trigger.

I've dropped some existing high-level cgroup_bpf_enabled in some
places because BPF_PROG_CGROUP_XXX_RUN macros usually have another
cgroup_bpf_enabled check.

I also had to copy-paste BPF_CGROUP_RUN_SA_PROG_LOCK for
GETPEERNAME/GETSOCKNAME because type for cgroup_bpf_enabled[type]
has to be constant and known at compile time.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210115163501.805133-4-sdf@google.com

a9ed15da

bpf: Try to avoid kzalloc in cgroup/{s,g}etsockopt · 20f2505f

Stanislav Fomichev authored Jan 15, 2021

When we attach a bpf program to cgroup/getsockopt any other getsockopt()
syscall starts incurring kzalloc/kfree cost.

Let add a small buffer on the stack and use it for small (majority)
{s,g}etsockopt values. The buffer is small enough to fit into
the cache line and cover the majority of simple options (most
of them are 4 byte ints).

It seems natural to do the same for setsockopt, but it's a bit more
involved when the BPF program modifies the data (where we have to
kmalloc). The assumption is that for the majority of setsockopt
calls (which are doing pure BPF options or apply policy) this
will bring some benefit as well.

Without this patch (we remove about 1% __kmalloc):
     3.38%     0.07%  tcp_mmap  [kernel.kallsyms]  [k] __cgroup_bpf_run_filter_getsockopt
            |
             --3.30%--__cgroup_bpf_run_filter_getsockopt
                       |
                        --0.81%--__kmalloc
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20210115163501.805133-3-sdf@google.com

20f2505f

bpf: Remove extra lock_sock for TCP_ZEROCOPY_RECEIVE · 9cacf81f

Stanislav Fomichev authored Jan 15, 2021

Add custom implementation of getsockopt hook for TCP_ZEROCOPY_RECEIVE.
We skip generic hooks for TCP_ZEROCOPY_RECEIVE and have a custom
call in do_tcp_getsockopt using the on-stack data. This removes
3% overhead for locking/unlocking the socket.

Without this patch:
     3.38%     0.07%  tcp_mmap  [kernel.kallsyms]  [k] __cgroup_bpf_run_filter_getsockopt
            |
             --3.30%--__cgroup_bpf_run_filter_getsockopt
                       |
                        --0.81%--__kmalloc

With the patch applied:
     0.52%     0.12%  tcp_mmap  [kernel.kallsyms]  [k] __cgroup_bpf_run_filter_getsockopt_kern

Note, exporting uapi/tcp.h requires removing netinet/tcp.h
from test_progs.h because those headers have confliciting
definitions.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20210115163501.805133-2-sdf@google.com

9cacf81f

bpf: Permit size-0 datasec · 13ca51d5

Yonghong Song authored Jan 19, 2021

llvm patch https://reviews.llvm.org/D84002 permitted
to emit empty rodata datasec if the elf .rodata section
contains read-only data from local variables. These
local variables will be not emitted as BTF_KIND_VARs
since llvm converted these local variables as
static variables with private linkage without debuginfo
types. Such an empty rodata datasec will make
skeleton code generation easy since for skeleton
a rodata struct will be generated if there is a
.rodata elf section. The existence of a rodata
btf datasec is also consistent with the existence
of a rodata map created by libbpf.

The btf with such an empty rodata datasec will fail
in the kernel though as kernel will reject a datasec
with zero vlen and zero size. For example, for the below code,
    int sys_enter(void *ctx)
    {
       int fmt[6] = {1, 2, 3, 4, 5, 6};
       int dst[6];

       bpf_probe_read(dst, sizeof(dst), fmt);
       return 0;
    }
We got the below btf (bpftool btf dump ./test.o):
    [1] PTR '(anon)' type_id=0
    [2] FUNC_PROTO '(anon)' ret_type_id=3 vlen=1
            'ctx' type_id=1
    [3] INT 'int' size=4 bits_offset=0 nr_bits=32 encoding=SIGNED
    [4] FUNC 'sys_enter' type_id=2 linkage=global
    [5] INT 'char' size=1 bits_offset=0 nr_bits=8 encoding=SIGNED
    [6] ARRAY '(anon)' type_id=5 index_type_id=7 nr_elems=4
    [7] INT '__ARRAY_SIZE_TYPE__' size=4 bits_offset=0 nr_bits=32 encoding=(none)
    [8] VAR '_license' type_id=6, linkage=global-alloc
    [9] DATASEC '.rodata' size=0 vlen=0
    [10] DATASEC 'license' size=0 vlen=1
            type_id=8 offset=0 size=4
When loading the ./test.o to the kernel with bpftool,
we see the following error:
    libbpf: Error loading BTF: Invalid argument(22)
    libbpf: magic: 0xeb9f
    ...
    [6] ARRAY (anon) type_id=5 index_type_id=7 nr_elems=4
    [7] INT __ARRAY_SIZE_TYPE__ size=4 bits_offset=0 nr_bits=32 encoding=(none)
    [8] VAR _license type_id=6 linkage=1
    [9] DATASEC .rodata size=24 vlen=0 vlen == 0
    libbpf: Error loading .BTF into kernel: -22. BTF is optional, ignoring.

Basically, libbpf changed .rodata datasec size to 24 since elf .rodata
section size is 24. The kernel then rejected the BTF since vlen = 0.
Note that the above kernel verifier failure can be worked around with
changing local variable "fmt" to a static or global, optionally const, variable.

This patch permits a datasec with vlen = 0 in kernel.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210119153519.3901963-1-yhs@fb.com

13ca51d5

Merge branch 'Allow attaching to bare tracepoints' · 71ee10e2

Alexei Starovoitov authored Jan 19, 2021

Qais Yousef says:

====================

Changes in v3:
	* Fix not returning error value correctly in
	  trigger_module_test_write() (Yonghong)
	* Add Yonghong acked-by to patch 1.

Changes in v2:
	* Fix compilation error. (Andrii)
	* Make the new test use write() instead of read() (Andrii)

Add some missing glue logic to teach bpf about bare tracepoints - tracepoints
without any trace event associated with them.

Bare tracepoints are declare with DECLARE_TRACE(). Full tracepoints are declare
with TRACE_EVENT().

BPF can attach to these tracepoints as RAW_TRACEPOINT() only as there're no
events in tracefs created with them.
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

71ee10e2

selftests: bpf: Add a new test for bare tracepoints · 407be922

Qais Yousef authored Jan 19, 2021

Reuse module_attach infrastructure to add a new bare tracepoint to check
we can attach to it as a raw tracepoint.
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210119122237.2426878-3-qais.yousef@arm.com

407be922

Merge branch 'bpf,x64: implement jump padding in jit' · 86e6b4e9

Alexei Starovoitov authored Jan 19, 2021

Gary Lin says:

====================
This patch series implements jump padding to x64 jit to cover some
corner cases that used to consume more than 20 jit passes and caused
failure.

v4:
  - Add the detailed comments about the possible padding bytes
  - Add the second test case which triggers jmp_cond padding and imm32 nop
    jmp padding.
  - Add the new test case as another subprog

v3:
  - Copy the instructions of prologue separately or the size calculation
    of the first BPF instruction would include the prologue.
  - Replace WARN_ONCE() with pr_err() and EFAULT
  - Use MAX_PASSES in the for loop condition check
  - Remove the "padded" flag from x64_jit_data. For the extra pass of
    subprogs, padding is always enabled since it won't hurt the images
    that converge without padding.
v2:
  - Simplify the sample code in the commit description and provide the
    jit code
  - Check the expected padding bytes with WARN_ONCE
  - Move the 'padded' flag to 'struct x64_jit_data'
  - Remove the EXPECTED_FAIL flag from bpf_fill_maxinsns11() in test_bpf
  - Add 2 verifier tests
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

86e6b4e9

trace: bpf: Allow bpf to attach to bare tracepoints · 6939f4ef

Qais Yousef authored Jan 19, 2021

Some subsystems only have bare tracepoints (a tracepoint with no
associated trace event) to avoid the problem of trace events being an
ABI that can't be changed.

>From bpf presepective, bare tracepoints are what it calls
RAW_TRACEPOINT().

Since bpf assumed there's 1:1 mapping, it relied on hooking to
DEFINE_EVENT() macro to create bpf mapping of the tracepoints. Since
bare tracepoints use DECLARE_TRACE() to create the tracepoint, bpf had
no knowledge about their existence.

By teaching bpf_probe.h to parse DECLARE_TRACE() in a similar fashion to
DEFINE_EVENT(), bpf can find and attach to the new raw tracepoints.

Enabling that comes with the contract that changes to raw tracepoints
don't constitute a regression if they break existing bpf programs.
We need the ability to continue to morph and modify these raw
tracepoints without worrying about any ABI.

Update Documentation/bpf/bpf_design_QA.rst to document this contract.
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210119122237.2426878-2-qais.yousef@arm.com

6939f4ef

selftests/bpf: Add verifier tests for x64 jit jump padding · 79d1b684

Gary Lin authored Jan 19, 2021

There are 3 tests added into verifier's jit tests to trigger x64
jit jump padding.

The first test can be represented as the following assembly code:

      1: bpf_call bpf_get_prandom_u32
      2: if r0 == 1 goto pc+128
      3: if r0 == 2 goto pc+128
         ...
    129: if r0 == 128 goto pc+128
    130: goto pc+128
    131: goto pc+127
         ...
    256: goto pc+2
    257: goto pc+1
    258: r0 = 1
    259: ret

We first store a random number to r0 and add the corresponding
conditional jumps (2~129) to make verifier believe that those jump
instructions from 130 to 257 are reachable. When the program is sent to
x64 jit, it starts to optimize out the NOP jumps backwards from 257.
Since there are 128 such jumps, the program easily reaches 15 passes and
triggers jump padding.

Here is the x64 jit code of the first test:

      0:    0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]
      5:    66 90                   xchg   ax,ax
      7:    55                      push   rbp
      8:    48 89 e5                mov    rbp,rsp
      b:    e8 4c 90 75 e3          call   0xffffffffe375905c
     10:    48 83 f8 01             cmp    rax,0x1
     14:    0f 84 fe 04 00 00       je     0x518
     1a:    48 83 f8 02             cmp    rax,0x2
     1e:    0f 84 f9 04 00 00       je     0x51d
      ...
     f6:    48 83 f8 18             cmp    rax,0x18
     fa:    0f 84 8b 04 00 00       je     0x58b
    100:    48 83 f8 19             cmp    rax,0x19
    104:    0f 84 86 04 00 00       je     0x590
    10a:    48 83 f8 1a             cmp    rax,0x1a
    10e:    0f 84 81 04 00 00       je     0x595
      ...
    500:    0f 84 83 01 00 00       je     0x689
    506:    48 81 f8 80 00 00 00    cmp    rax,0x80
    50d:    0f 84 76 01 00 00       je     0x689
    513:    e9 71 01 00 00          jmp    0x689
    518:    e9 6c 01 00 00          jmp    0x689
      ...
    5fe:    e9 86 00 00 00          jmp    0x689
    603:    e9 81 00 00 00          jmp    0x689
    608:    0f 1f 00                nop    DWORD PTR [rax]
    60b:    eb 7c                   jmp    0x689
    60d:    eb 7a                   jmp    0x689
      ...
    683:    eb 04                   jmp    0x689
    685:    eb 02                   jmp    0x689
    687:    66 90                   xchg   ax,ax
    689:    b8 01 00 00 00          mov    eax,0x1
    68e:    c9                      leave
    68f:    c3                      ret

As expected, a 3 bytes NOPs is inserted at 608 due to the transition
from imm32 jmp to imm8 jmp. A 2 bytes NOPs is also inserted at 687 to
replace a NOP jump.

The second test case is tricky. Here is the assembly code:

       1: bpf_call bpf_get_prandom_u32
       2: if r0 == 1 goto pc+2048
       3: if r0 == 2 goto pc+2048
       ...
    2049: if r0 == 2048 goto pc+2048
    2050: goto pc+2048
    2051: goto pc+16
    2052: goto pc+15
       ...
    2064: goto pc+3
    2065: goto pc+2
    2066: goto pc+1
       ...
       [repeat "goto pc+16".."goto pc+1" 127 times]
       ...
    4099: r0 = 2
    4100: ret

There are 4 major parts of the program.
1) 1~2049: Those are instructions to make 2050~4098 reachable. Some of
           them also could generate the padding for jmp_cond.
2) 2050: This is the target instruction for the imm32 nop jmp padding.
3) 2051~4098: The repeated "goto 1~16" instructions are designed to be
              consumed by the nop jmp optimization. In the end, those
              instrucitons become 128 continuous 0 offset jmp and are
              optimized out in 1 pass, and this make insn 2050 an imm32
              nop jmp in the next pass, so that we can trigger the
              5 bytes padding.
4) 4099~4100: Those are the instructions to end the program.

The x64 jit code is like this:

       0:       0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]
       5:       66 90                   xchg   ax,ax
       7:       55                      push   rbp
       8:       48 89 e5                mov    rbp,rsp
       b:       e8 bc 7b d5 d3          call   0xffffffffd3d57bcc
      10:       48 83 f8 01             cmp    rax,0x1
      14:       0f 84 7e 66 00 00       je     0x6698
      1a:       48 83 f8 02             cmp    rax,0x2
      1e:       0f 84 74 66 00 00       je     0x6698
      24:       48 83 f8 03             cmp    rax,0x3
      28:       0f 84 6a 66 00 00       je     0x6698
      2e:       48 83 f8 04             cmp    rax,0x4
      32:       0f 84 60 66 00 00       je     0x6698
      38:       48 83 f8 05             cmp    rax,0x5
      3c:       0f 84 56 66 00 00       je     0x6698
      42:       48 83 f8 06             cmp    rax,0x6
      46:       0f 84 4c 66 00 00       je     0x6698
      ...
    666c:       48 81 f8 fe 07 00 00    cmp    rax,0x7fe
    6673:       0f 1f 40 00             nop    DWORD PTR [rax+0x0]
    6677:       74 1f                   je     0x6698
    6679:       48 81 f8 ff 07 00 00    cmp    rax,0x7ff
    6680:       0f 1f 40 00             nop    DWORD PTR [rax+0x0]
    6684:       74 12                   je     0x6698
    6686:       48 81 f8 00 08 00 00    cmp    rax,0x800
    668d:       0f 1f 40 00             nop    DWORD PTR [rax+0x0]
    6691:       74 05                   je     0x6698
    6693:       0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]
    6698:       b8 02 00 00 00          mov    eax,0x2
    669d:       c9                      leave
    669e:       c3                      ret

Since insn 2051~4098 are optimized out right before the padding pass,
there are several conditional jumps from the first part are replaced with
imm8 jmp_cond, and this triggers the 4 bytes padding, for example at
6673, 6680, and 668d. On the other hand, Insn 2050 is replaced with the
5 bytes nops at 6693.

The third test is to invoke the first and second tests as subprogs to test
bpf2bpf. Per the system log, there was one more jit happened with only
one pass and the same jit code was produced.

v4:
  - Add the second test case which triggers jmp_cond padding and imm32 nop
    jmp padding.
  - Add the new test case as another subprog
Signed-off-by: Gary Lin <glin@suse.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210119102501.511-4-glin@suse.com

79d1b684

test_bpf: Remove EXPECTED_FAIL flag from bpf_fill_maxinsns11 · 16a660ef

Gary Lin authored Jan 19, 2021

With NOPs padding, x64 jit now can handle the jump cases like
bpf_fill_maxinsns11().
Signed-off-by: Gary Lin <glin@suse.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210119102501.511-3-glin@suse.com

16a660ef