Commits · 50450fc716c1a570ee8d8bfe198ef5d3cfca36e4 · Kirill Smelkov / linux

30 Jul, 2020 9 commits

libbpf: Make destructors more robust by handling ERR_PTR(err) cases · 50450fc7

Andrii Nakryiko authored Jul 29, 2020

Most of libbpf "constructors" on failure return ERR_PTR(err) result encoded as
a pointer. It's a common mistake to eventually pass such malformed pointers
into xxx__destroy()/xxx__free() "destructors". So instead of fixing up
clean up code in selftests and user programs, handle such error pointers in
destructors themselves. This works beautifully for NULL pointers passed to
destructors, so might as well just work for error pointers.
Suggested-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200729232148.896125-1-andriin@fb.com

50450fc7

selftests/bpf: Omit nodad flag when adding addresses to loopback · a6599abd

Jakub Sitnicki authored Jul 30, 2020

Setting IFA_F_NODAD flag for IPv6 addresses to add to loopback is
unnecessary. Duplicate Address Detection does not happen on loopback
device.

Also, passing 'nodad' flag to 'ip address' breaks libbpf CI, which runs in
an environment with BusyBox implementation of 'ip' command, that doesn't
understand this flag.

Fixes: 0ab5539f ("selftests/bpf: Tests for BPF_SK_LOOKUP attach point")
Reported-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Andrii Nakryiko <andrii@fb.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200730125325.1869363-1-jakub@cloudflare.com

a6599abd

selftests/bpf: Don't destroy failed link · 80546ac4

Andrii Nakryiko authored Jul 28, 2020

Check that link is NULL or proper pointer before invoking bpf_link__destroy().
Not doing this causes crash in test_progs, when cg_storage_multi selftest
fails.

Fixes: 3573f384 ("selftests/bpf: Test CGROUP_STORAGE behavior on shared egress + ingress")
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200729045056.3363921-1-andriin@fb.com

80546ac4

selftests/bpf: Add xdpdrv mode for test_xdp_redirect · dfdb0d93

Hangbin Liu authored Jul 29, 2020

This patch add xdpdrv mode for test_xdp_redirect.sh since veth has support
native mode. After update here is the test result:

  # ./test_xdp_redirect.sh
  selftests: test_xdp_redirect xdpgeneric [PASS]
  selftests: test_xdp_redirect xdpdrv [PASS]
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: William Tu <u9012063@gmail.com>
Link: https://lore.kernel.org/bpf/20200729085658.403794-1-liuhangbin@gmail.com

dfdb0d93

selftests/bpf: Verify socket storage in cgroup/sock_{create, release} · 4fb5f949

Stanislav Fomichev authored Jul 28, 2020

Augment udp_limit test to set and verify socket storage value.
That should be enough to exercise the changes from the previous
patch.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200729003104.1280813-2-sdf@google.com

4fb5f949

bpf: Expose socket storage to BPF_PROG_TYPE_CGROUP_SOCK · f7c6cb1d

Stanislav Fomichev authored Jul 28, 2020

This lets us use socket storage from the following hooks:

* BPF_CGROUP_INET_SOCK_CREATE
* BPF_CGROUP_INET_SOCK_RELEASE
* BPF_CGROUP_INET4_POST_BIND
* BPF_CGROUP_INET6_POST_BIND

Using existing 'bpf_sk_storage_get_proto' doesn't work because
second argument is ARG_PTR_TO_SOCKET. Even though
BPF_PROG_TYPE_CGROUP_SOCK hooks operate on 'struct bpf_sock',
the verifier still considers it as a PTR_TO_CTX.
That's why I'm adding another 'bpf_sk_storage_get_cg_sock_proto'
definition strictly for BPF_PROG_TYPE_CGROUP_SOCK which accepts
ARG_PTR_TO_CTX which is really 'struct sock' for this program type.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200729003104.1280813-1-sdf@google.com

f7c6cb1d

selftests/bpf: Test bpf_iter buffer access with negative offset · 12e6196f

Yonghong Song authored Jul 28, 2020

Commit afbf21dc ("bpf: Support readonly/readwrite buffers
in verifier") added readonly/readwrite buffer support which
is currently used by bpf_iter tracing programs. It has
a bug with incorrect parameter ordering which later fixed
by Commit f6dfbe31 ("bpf: Fix swapped arguments in calls
to check_buffer_access").

This patch added a test case with a negative offset access
which will trigger the error path.

Without Commit f6dfbe31, running the test case in the patch,
the error message looks like:
   R1_w=rdwr_buf(id=0,off=0,imm=0) R10=fp0
  ; value_sum += *(__u32 *)(value - 4);
  2: (61) r1 = *(u32 *)(r1 -4)
  R1 invalid (null) buffer access: off=-4, size=4

With the above commit, the error message looks like:
   R1_w=rdwr_buf(id=0,off=0,imm=0) R10=fp0
  ; value_sum += *(__u32 *)(value - 4);
  2: (61) r1 = *(u32 *)(r1 -4)
  R1 invalid rdwr buffer access: off=-4, size=4
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200728221801.1090406-1-yhs@fb.com

12e6196f

bpf: Add missing newline characters in verifier error messages · 4fc00b79

Yonghong Song authored Jul 28, 2020

Newline characters are added in two verifier error messages,
refactored in Commit afbf21dc ("bpf: Support readonly/readwrite
buffers in verifier"). This way, they do not mix with
messages afterwards.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200728221801.1090349-1-yhs@fb.com

4fc00b79

bpf, arm64: Add BPF exception tables · 80083428

Jean-Philippe Brucker authored Jul 28, 2020

When a tracing BPF program attempts to read memory without using the
bpf_probe_read() helper, the verifier marks the load instruction with
the BPF_PROBE_MEM flag. Since the arm64 JIT does not currently recognize
this flag it falls back to the interpreter.

Add support for BPF_PROBE_MEM, by appending an exception table to the
BPF program. If the load instruction causes a data abort, the fixup
infrastructure finds the exception table and fixes up the fault, by
clearing the destination register and jumping over the faulting
instruction.

To keep the compact exception table entry format, inspect the pc in
fixup_exception(). A more generic solution would add a "handler" field
to the table entry, like on x86 and s390.
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200728152122.1292756-2-jean-philippe@linaro.org

80083428

28 Jul, 2020 6 commits

bpf: Fix build without CONFIG_NET when using BPF XDP link · 310ad797

Andrii Nakryiko authored Jul 28, 2020

Entire net/core subsystem is not built without CONFIG_NET. linux/netdevice.h
just assumes that it's always there, so the easiest way to fix this is to
conditionally compile out bpf_xdp_link_attach() use in bpf/syscall.c.

Fixes: aa8d3a71 ("bpf, xdp: Add bpf_link-based XDP attachment API")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200728190527.110830-1-andriin@fb.com

310ad797

bpf, selftests: use :: 1 for localhost in tcp_server.py · ca5cd355

John Fastabend authored Jul 28, 2020

Using localhost requires the host to have a /etc/hosts file with that
specific line in it. By default my dev box did not, they used
ip6-localhost, so the test was failing. To fix remove the need for any
/etc/hosts and use ::1.

I could just add the line, but this seems easier.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/159594714197.21431.10113693935099326445.stgit@john-Precision-5820-Tower

ca5cd355

xdp: Prevent kernel-infoleak in xsk_getsockopt() · 3c4f850e

Peilin Ye authored Jul 28, 2020

xsk_getsockopt() is copying uninitialized stack memory to userspace when
'extra_stats' is 'false'. Fix it. Doing '= {};' is sufficient since currently
'struct xdp_statistics' is defined as follows:

  struct xdp_statistics {
    __u64 rx_dropped;
    __u64 rx_invalid_descs;
    __u64 tx_invalid_descs;
    __u64 rx_ring_full;
    __u64 rx_fill_ring_empty_descs;
    __u64 tx_ring_empty_descs;
  };

When being copied to the userspace, 'stats' will not contain any uninitialized
'holes' between struct fields.

Fixes: 8aa5a335 ("xsk: Add new statistics")
Suggested-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Peilin Ye <yepeilin.cs@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/bpf/20200728053604.404631-1-yepeilin.cs@gmail.com

3c4f850e

bpf: Fix swapped arguments in calls to check_buffer_access · f6dfbe31

Colin Ian King authored Jul 27, 2020

There are a couple of arguments of the boolean flag zero_size_allowed and
the char pointer buf_info when calling to function check_buffer_access that
are swapped by mistake. Fix these by swapping them to correct the argument
ordering.

Fixes: afbf21dc ("bpf: Support readonly/readwrite buffers in verifier")
Addresses-Coverity: ("Array compared to 0")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200727175411.155179-1-colin.king@canonical.com

f6dfbe31

selftests/bpf: Add new bpf_iter context structs to fix build on old kernels · 363885d7

Andrii Nakryiko authored Jul 27, 2020

Add bpf_iter__bpf_map_elem and bpf_iter__bpf_sk_storage_map to bpf_iter.h.

Fixes: 3b1c420b ("selftests/bpf: Add a test for bpf sk_storage_map iterator")
Fixes: 2a7c2fff ("selftests/bpf: Add test for bpf hash map iterators")
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200727233345.1686358-1-andriin@fb.com

363885d7

bpf: Fix bpf_ringbuf_output() signature to return long · e1613b57

Andrii Nakryiko authored Jul 27, 2020

Due to bpf tree fix merge, bpf_ringbuf_output() signature ended up with int as
a return type, while all other helpers got converted to returning long. So fix
it in bpf-next now.

Fixes: b0659d8a ("bpf: Fix definition of bpf_ringbuf_output() helper in UAPI comments")
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200727224715.652037-1-andriin@fb.com

e1613b57

27 Jul, 2020 2 commits

tools, bpftool: Add LSM type to array of prog names · 9a97c9d2

Quentin Monnet authored Jul 24, 2020

Assign "lsm" as a printed name for BPF_PROG_TYPE_LSM in bpftool, so that
it can use it when listing programs loaded on the system or when probing
features.
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200724090618.16378-3-quentin@isovalent.com

9a97c9d2

tools, bpftool: Skip type probe if name is not found · 70cfab1d

Quentin Monnet authored Jul 24, 2020

For probing program and map types, bpftool loops on type values and uses
the relevant type name in prog_type_name[] or map_type_name[]. To ensure
the name exists, we exit from the loop if we go over the size of the
array.

However, this is not enough in the case where the arrays have "holes" in
them, program or map types for which they have no name, but not at the
end of the list. This is currently the case for BPF_PROG_TYPE_LSM, not
known to bpftool and which name is a null string. When probing for
features, bpftool attempts to strlen() that name and segfaults.

Let's fix it by skipping probes for "unknown" program and map types,
with an informational message giving the numeral value in that case.

Fixes: 93a3545d ("tools/bpftool: Add name mappings for SK_LOOKUP prog and attach type")
Reported-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200724090618.16378-2-quentin@isovalent.com

70cfab1d

26 Jul, 2020 23 commits

Merge branch 'bpf_link-XDP' · 47960ad6

Alexei Starovoitov authored Jul 25, 2020

Andrii Nakryiko says:

====================
Following cgroup and netns examples, implement bpf_link support for XDP.

The semantics is described in patch #2. Program and link attachments are
mutually exclusive, in the sense that neither link can replace attached
program nor program can replace attached link. Link can't replace attached
link as well, as is the case for any other bpf_link implementation.

Patch #1 refactors existing BPF program-based attachment API and centralizes
high-level query/attach decisions in generic kernel code, while drivers are
kept simple and are instructed with low-level decisions about attaching and
detaching specific bpf_prog. This also makes QUERY command unnecessary, and
patch #8 removes support for it from all kernel drivers. If that's a bad idea,
we can drop that patch altogether.

With refactoring in patch #1, adding bpf_xdp_link is completely transparent to
drivers, they are still functioning at the level of "effective" bpf_prog, that
should be called in XDP data path.

Corresponding libbpf support for BPF XDP link is added in patch #5.

v3->v4:
- fix a compilation warning in one of drivers (Jakub);

v2->v3:
- fix build when CONFIG_BPF_SYSCALL=n (kernel test robot);

v1->v2:
- fix prog refcounting bug (David);
- split dev_change_xdp_fd() changes into 2 patches (David);
- add extack messages to all user-induced errors (David).
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

47960ad6

bpf, xdp: Remove XDP_QUERY_PROG and XDP_QUERY_PROG_HW XDP commands · e8407fde

Andrii Nakryiko authored Jul 21, 2020

Now that BPF program/link management is centralized in generic net_device
code, kernel code never queries program id from drivers, so
XDP_QUERY_PROG/XDP_QUERY_PROG_HW commands are unnecessary.

This patch removes all the implementations of those commands in kernel, along
the xdp_attachment_query().

This patch was compile-tested on allyesconfig.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200722064603.3350758-10-andriin@fb.com

e8407fde

selftests/bpf: Add BPF XDP link selftests · fe48230c

Andrii Nakryiko authored Jul 21, 2020

Add selftest validating all the attachment logic around BPF XDP link. Test
also link updates and get_obj_info() APIs.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200722064603.3350758-9-andriin@fb.com

fe48230c

libbpf: Add support for BPF XDP link · dc8698ca

Andrii Nakryiko authored Jul 21, 2020

Sync UAPI header and add support for using bpf_link-based XDP attachment.
Make xdp/ prog type set expected attach type. Kernel didn't enforce
attach_type for XDP programs before, so there is no backwards compatiblity
issues there.

Also fix section_names selftest to recognize that xdp prog types now have
expected attach type.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200722064603.3350758-8-andriin@fb.com

dc8698ca

bpf: Implement BPF XDP link-specific introspection APIs · c1931c97

Andrii Nakryiko authored Jul 21, 2020

Implement XDP link-specific show_fdinfo and link_info to emit ifindex.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200722064603.3350758-7-andriin@fb.com

c1931c97

bpf, xdp: Implement LINK_UPDATE for BPF XDP link · 026a4c28

Andrii Nakryiko authored Jul 21, 2020

Add support for LINK_UPDATE command for BPF XDP link to enable reliable
replacement of underlying BPF program.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200722064603.3350758-6-andriin@fb.com

026a4c28

bpf, xdp: Add bpf_link-based XDP attachment API · aa8d3a71

Andrii Nakryiko authored Jul 21, 2020

Add bpf_link-based API (bpf_xdp_link) to attach BPF XDP program through
BPF_LINK_CREATE command.

bpf_xdp_link is mutually exclusive with direct BPF program attachment,
previous BPF program should be detached prior to attempting to create a new
bpf_xdp_link attachment (for a given XDP mode). Once BPF link is attached, it
can't be replaced by other BPF program attachment or link attachment. It will
be detached only when the last BPF link FD is closed.

bpf_xdp_link will be auto-detached when net_device is shutdown, similarly to
how other BPF links behave (cgroup, flow_dissector). At that point bpf_link
will become defunct, but won't be destroyed until last FD is closed.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200722064603.3350758-5-andriin@fb.com

aa8d3a71

bpf, xdp: Extract common XDP program attachment logic · d4baa936

Andrii Nakryiko authored Jul 21, 2020

Further refactor XDP attachment code. dev_change_xdp_fd() is split into two
parts: getting bpf_progs from FDs and attachment logic, working with
bpf_progs. This makes attachment logic a bit more straightforward and
prepares code for bpf_xdp_link inclusion, which will share the common logic.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200722064603.3350758-4-andriin@fb.com

d4baa936

bpf, xdp: Maintain info on attached XDP BPF programs in net_device · 7f0a8382

Andrii Nakryiko authored Jul 21, 2020

Instead of delegating to drivers, maintain information about which BPF
programs are attached in which XDP modes (generic/skb, driver, or hardware)
locally in net_device. This effectively obsoletes XDP_QUERY_PROG command.

Such re-organization simplifies existing code already. But it also allows to
further add bpf_link-based XDP attachments without drivers having to know
about any of this at all, which seems like a good setup.
XDP_SETUP_PROG/XDP_SETUP_PROG_HW are just low-level commands to driver to
install/uninstall active BPF program. All the higher-level concerns about
prog/link interaction will be contained within generic driver-agnostic logic.

All the XDP_QUERY_PROG calls to driver in dev_xdp_uninstall() were removed.
It's not clear for me why dev_xdp_uninstall() were passing previous prog_flags
when resetting installed programs. That seems unnecessary, plus most drivers
don't populate prog_flags anyways. Having XDP_SETUP_PROG vs XDP_SETUP_PROG_HW
should be enough of an indicator of what is required of driver to correctly
reset active BPF program. dev_xdp_uninstall() is also generalized as an
iteration over all three supported mode.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200722064603.3350758-3-andriin@fb.com

7f0a8382

bpf: Make bpf_link API available indepently of CONFIG_BPF_SYSCALL · 6cc7d1e8

Andrii Nakryiko authored Jul 21, 2020

Similarly to bpf_prog, make bpf_link and related generic API available
unconditionally to make it easier to have bpf_link support in various parts of
the kernel. Stub out init/prime/settle/cleanup and inc/put APIs.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200722064603.3350758-2-andriin@fb.com

6cc7d1e8

bpf: Fix build on architectures with special bpf_user_pt_regs_t · 2b9b305f

Song Liu authored Jul 24, 2020

Architectures like s390, powerpc, arm64, riscv have speical definition of
bpf_user_pt_regs_t. So we need to cast the pointer before passing it to
bpf_get_stack(). This is similar to bpf_get_stack_tp().

Fixes: 03d42fd2d83f ("bpf: Separate bpf_get_[stack|stackid] for perf events BPF")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200724200503.3629591-1-songliubraving@fb.com

2b9b305f

bpf/local_storage: Fix build without CONFIG_CGROUP · dfcdf0e9

YiFei Zhu authored Jul 24, 2020

local_storage.o has its compile guard as CONFIG_BPF_SYSCALL, which
does not imply that CONFIG_CGROUP is on. Including cgroup-internal.h
when CONFIG_CGROUP is off cause a compilation failure.

Fixes: f67cfc233706 ("bpf: Make cgroup storages shared between programs on the same cgroup")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: YiFei Zhu <zhuyifei@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200724211753.902969-1-zhuyifei1999@gmail.com

dfcdf0e9

Merge branch 'shared-cgroup-storage' · 36f72484

Alexei Starovoitov authored Jul 23, 2020

YiFei Zhu says:

====================
To access the storage in a CGROUP_STORAGE map, one uses
bpf_get_local_storage helper, which is extremely fast due to its
use of per-CPU variables. However, its whole code is built on
the assumption that one map can only be used by one program at any
time, and this prohibits any sharing of data between multiple
programs using these maps, eliminating a lot of use cases, such
as some per-cgroup configuration storage, written to by a
setsockopt program and read by a cg_sock_addr program.

Why not use other map types? The great part of CGROUP_STORAGE map
is that it is isolated by different cgroups its attached to. When
one program uses bpf_get_local_storage, even on the same map, it
gets different storages if it were run as a result of attaching
to different cgroups. The kernel manages the storages, simplifying
BPF program or userspace. In theory, one could probably use other
maps like array or hash to do the same thing, but it would be a
major overhead / complexity. Userspace needs to know when a cgroup
is being freed in order to free up a space in the replacement map.

This patch set introduces a significant change to the semantics of
CGROUP_STORAGE map type. Instead of each storage being tied to one
single attachment, it is shared across different attachments to
the same cgroup, and persists until either the map or the cgroup
attached to is being freed.

User may use u64 as the key to the map, and the result would be
that the attach type become ignored during key comparison, and
programs of different attach types will share the same storage if
the cgroups they are attached to are the same.

How could this break existing users?
* Users that uses detach & reattach / program replacement as a
  shortcut to zeroing the storage. Since we need sharing between
  programs, we cannot zero the storage. Users that expect this
  behavior should either attach a program with a new map, or
  explicitly zero the map with a syscall.
This case is dependent on undocumented implementation details,
so the impact should be very minimal.

Patch 1 introduces a test on the old expected behavior of the map
type.

Patch 2 introduces a test showing how two programs cannot share
one such map.

Patch 3 implements the change of semantics to the map.

Patch 4 amends the new test such that it yields the behavior we
expect from the change.

Patch 5 documents the map type.

Changes since RFC:
* Clarify commit message in patch 3 such that it says the lifetime
  of the storage is ended at the freeing of the cgroup_bpf, rather
  than the cgroup itself.
* Restored an -ENOMEM check in __cgroup_bpf_attach.
* Update selftests for recent change in network_helpers API.

Changes since v1:
* s/CHECK_FAIL/CHECK/
* s/bpf_prog_attach/bpf_program__attach_cgroup/
* Moved test__start_subtest to test_cg_storage_multi.
* Removed some redundant CHECK_FAIL where they are already CHECK-ed.

Changes since v2:
* Lock cgroup_mutex during map_free.
* Publish new storages only if attach is successful, by tracking
  exactly which storages are reused in an array of bools.
* Mention bpftool map dump showing a value of zero for attach_type
  in patch 3 commit message.

Changes since v3:
* Use a much simpler lookup and allocate-if-not-exist from the fact
  that cgroup_mutex is locked during attach.
* Removed an unnecessary spinlock hold.

Changes since v4:
* Changed semantics so that if the key type is struct
  bpf_cgroup_storage_key the map retains isolation between different
  attach types. Sharing between different attach types only occur
  when key type is u64.
* Adapted tests and docs for the above change.

Changes since v5:
* Removed redundant NULL check before bpf_link__destroy.
* Free BPF object explicitly, after asserting that object failed to
  load, in the event that the object did not fail to load.
* Rename variable in bpf_cgroup_storage_key_cmp for clarity.
* Added a lot of information to Documentation, more or less copied
  from what Martin KaFai Lau wrote.
====================
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

36f72484

Documentation/bpf: Document CGROUP_STORAGE map type · 4e15f460

YiFei Zhu authored Jul 23, 2020

The machanics and usage are not very straightforward. Given the
changes it's better to document how it works and how to use it,
rather than having to rely on the examples and implementation to
infer what is going on.
Signed-off-by: YiFei Zhu <zhuyifei@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/b412edfbb05cb1077c9e2a36a981a54ee23fa8b3.1595565795.git.zhuyifei@google.com

4e15f460

selftests/bpf: Test CGROUP_STORAGE behavior on shared egress + ingress · 3573f384

YiFei Zhu authored Jul 23, 2020

This mirrors the original egress-only test. The cgroup_storage is
now extended to have two packet counters, one for egress and one
for ingress. We also extend to have two egress programs to test
that egress will always share with other egress origrams in the
same cgroup. The behavior of the counters are exactly the same as
the original egress-only test.

The test is split into two, one "isolated" test that when the key
type is struct bpf_cgroup_storage_key, which contains the attach
type, programs of different attach types will see different
storages. The other, "shared" test that when the key type is u64,
programs of different attach types will see the same storage if
they are attached to the same cgroup.
Signed-off-by: YiFei Zhu <zhuyifei@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/c756f5f1521227b8e6e90a453299dda722d7324d.1595565795.git.zhuyifei@google.com

3573f384

Merge branch 'fix-bpf_get_stack-with-PEBS' · 90065c06

Alexei Starovoitov authored Jul 23, 2020

Song Liu says:

====================
Calling get_perf_callchain() on perf_events from PEBS entries may cause
unwinder errors. To fix this issue, perf subsystem fetches callchain early,
and marks perf_events are marked with __PERF_SAMPLE_CALLCHAIN_EARLY.
Similar issue exists when BPF program calls get_perf_callchain() via
helper functions. For more information about this issue, please refer to
discussions in [1].

This set fixes this issue with helper proto bpf_get_stackid_pe and
bpf_get_stack_pe.

[1] https://lore.kernel.org/lkml/ED7B9430-6489-4260-B3C5-9CFA2E3AA87A@fb.com/

Changes v4 => v5:
1. Return -EPROTO instead of -EINVAL on PERF_EVENT_IOC_SET_BPF errors.
   (Alexei)
2. Let libbpf print a hint message when PERF_EVENT_IOC_SET_BPF returns
   -EPROTO. (Alexei)

Changes v3 => v4:
1. Fix error check logic in bpf_get_stackid_pe and bpf_get_stack_pe.
   (Alexei)
2. Do not allow attaching BPF programs with bpf_get_stack|stackid to
   perf_event with precise_ip > 0, but not proper callchain. (Alexei)
3. Add selftest get_stackid_cannot_attach.

Changes v2 => v3:
1. Fix handling of stackmap skip field. (Andrii)
2. Simplify the code in a few places. (Andrii)

Changes v1 => v2:
1. Simplify the design and avoid introducing new helper function. (Andrii)
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

90065c06

bpf: Make cgroup storages shared between programs on the same cgroup · 7d9c3427

YiFei Zhu authored Jul 23, 2020

This change comes in several parts:

One, the restriction that the CGROUP_STORAGE map can only be used
by one program is removed. This results in the removal of the field
'aux' in struct bpf_cgroup_storage_map, and removal of relevant
code associated with the field, and removal of now-noop functions
bpf_free_cgroup_storage and bpf_cgroup_storage_release.

Second, we permit a key of type u64 as the key to the map.
Providing such a key type indicates that the map should ignore
attach type when comparing map keys. However, for simplicity newly
linked storage will still have the attach type at link time in
its key struct. cgroup_storage_check_btf is adapted to accept
u64 as the type of the key.

Third, because the storages are now shared, the storages cannot
be unconditionally freed on program detach. There could be two
ways to solve this issue:
* A. Reference count the usage of the storages, and free when the
last program is detached.
* B. Free only when the storage is impossible to be referred to
again, i.e. when either the cgroup_bpf it is attached to, or
the map itself, is freed.
Option A has the side effect that, when the user detach and
reattach a program, whether the program gets a fresh storage
depends on whether there is another program attached using that
storage. This could trigger races if the user is multi-threaded,
and since nondeterminism in data races is evil, go with option B.

The both the map and the cgroup_bpf now tracks their associated
storages, and the storage unlink and free are removed from
cgroup_bpf_detach and added to cgroup_bpf_release and
cgroup_storage_map_free. The latter also new holds the cgroup_mutex
to prevent any races with the former.

Fourth, on attach, we reuse the old storage if the key already
exists in the map, via cgroup_storage_lookup. If the storage
does not exist yet, we create a new one, and publish it at the
last step in the attach process. This does not create a race
condition because for the whole attach the cgroup_mutex is held.
We keep track of an array of new storages that was allocated
and if the process fails only the new storages would get freed.
Signed-off-by: YiFei Zhu <zhuyifei@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/d5401c6106728a00890401190db40020a1f84ff1.1595565795.git.zhuyifei@google.com

7d9c3427

selftests/bpf: Add get_stackid_cannot_attach · 346938e9

Song Liu authored Jul 23, 2020

This test confirms that BPF program that calls bpf_get_stackid() cannot
attach to perf_event with precise_ip > 0 but not PERF_SAMPLE_CALLCHAIN;
and cannot attach if the perf_event has exclude_callchain_kernel.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200723180648.1429892-6-songliubraving@fb.com

346938e9

selftests/bpf: Test CGROUP_STORAGE map can't be used by multiple progs · 9e5bd1f7

YiFei Zhu authored Jul 23, 2020

The current assumption is that the lifetime of a cgroup storage
is tied to the program's attachment. The storage is created in
cgroup_bpf_attach, and released upon cgroup_bpf_detach and
cgroup_bpf_release.

Because the current semantics is that each attachment gets a
completely independent cgroup storage, and you can have multiple
programs attached to the same (cgroup, attach type) pair, the key
of the CGROUP_STORAGE map, looking up the map with this pair could
yield multiple storages, and that is not permitted. Therefore,
the kernel verifier checks that two programs cannot share the same
CGROUP_STORAGE map, even if they have different expected attach
types, considering that the actual attach type does not always
have to be equal to the expected attach type.

The test creates a CGROUP_STORAGE map and make it shared across
two different programs, one cgroup_skb/egress and one /ingress.
It asserts that the two programs cannot be both loaded, due to
verifier failure from the above reason.
Signed-off-by: YiFei Zhu <zhuyifei@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/30a6b0da67ae6b0296c4d511bfb19c5f3d035916.1595565795.git.zhuyifei@google.com

9e5bd1f7

selftests/bpf: Add callchain_stackid · 1da4864c

Song Liu authored Jul 23, 2020

This tests new helper function bpf_get_stackid_pe and bpf_get_stack_pe.
These two helpers have different implementation for perf_event with PEB
entries.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200723180648.1429892-5-songliubraving@fb.com

1da4864c

selftests/bpf: Add test for CGROUP_STORAGE map on multiple attaches · d4a89c1e

YiFei Zhu authored Jul 23, 2020

This test creates a parent cgroup, and a child of that cgroup.
It attaches a cgroup_skb/egress program that simply counts packets,
to a global variable (ARRAY map), and to a CGROUP_STORAGE map.
The program is first attached to the parent cgroup only, then to
parent and child.

The test cases sends a message within the child cgroup, and because
the program is inherited across parent / child cgroups, it will
trigger the egress program for both the parent and child, if they
exist. The program, when looking up a CGROUP_STORAGE map, uses the
cgroup and attach type of the attachment parameters; therefore,
both attaches uses different cgroup storages.

We assert that all packet counts returns what we expects.
Signed-off-by: YiFei Zhu <zhuyifei@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/5a20206afa4606144691c7caa0d1b997cd60dec0.1595565795.git.zhuyifei@google.com

d4a89c1e

libbpf: Print hint when PERF_EVENT_IOC_SET_BPF returns -EPROTO · d4b4dd6c

Song Liu authored Jul 23, 2020

The kernel prevents potential unwinder warnings and crashes by blocking
BPF program with bpf_get_[stack|stackid] on perf_event without
PERF_SAMPLE_CALLCHAIN, or with exclude_callchain_[kernel|user]. Print a
hint message in libbpf to help the user debug such issues.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200723180648.1429892-4-songliubraving@fb.com

d4b4dd6c

Merge branch 'bpf_iter-for-map-elems' · 909e446b

Alexei Starovoitov authored Jul 23, 2020

Yonghong Song says:

====================
Bpf iterator has been implemented for task, task_file,
bpf_map, ipv6_route, netlink, tcp and udp so far.

For map elements, there are two ways to traverse all elements from
user space:
  1. using BPF_MAP_GET_NEXT_KEY bpf subcommand to get elements
     one by one.
  2. using BPF_MAP_LOOKUP_BATCH bpf subcommand to get a batch of
     elements.
Both these approaches need to copy data from kernel to user space
in order to do inspection.

This patch implements bpf iterator for map elements.
User can have a bpf program in kernel to run with each map element,
do checking, filtering, aggregation, modifying values etc.
without copying data to user space.

Patch #1 and #2 are refactoring. Patch #3 implements readonly/readwrite
buffer support in verifier. Patches #4 - #7 implements map element
support for hash, percpu hash, lru hash lru percpu hash, array,
percpu array and sock local storage maps. Patches #8 - #9 are libbpf
and bpftool support. Patches #10 - #13 are selftests for implemented
map element iterators.

Changelogs:
  v3 -> v4:
    . fix a kasan failure triggered by a failed bpf_iter link_create,
      not just free_link but need cleanup_link. (Alexei)
  v2 -> v3:
    . rebase on top of latest bpf-next
  v1 -> v2:
    . support to modify map element values. (Alexei)
    . map key/values can be used with helper arguments
      for those arguments with ARG_PTR_TO_MEM or
      ARG_PTR_TO_INIT_MEM register type. (Alexei)
    . remove usused variable. (kernel test robot)
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

909e446b