Commits · 006c0e44ed924140d44bc756e6ea36301fcea68d · Kirill Smelkov / linux

21 Apr, 2023 14 commits

selftests/bpf: add missing netfilter return value and ctx access tests · 006c0e44

Florian Westphal authored Apr 21, 2023

Extend prog_tests with two test cases:

 # ./test_progs --allow=verifier_netfilter_retcode
 #278/1   verifier_netfilter_retcode/bpf_exit with invalid return code. test1:OK
 #278/2   verifier_netfilter_retcode/bpf_exit with valid return code. test2:OK
 #278/3   verifier_netfilter_retcode/bpf_exit with valid return code. test3:OK
 #278/4   verifier_netfilter_retcode/bpf_exit with invalid return code. test4:OK
 #278     verifier_netfilter_retcode:OK

This checks that only accept and drop (0,1) are permitted.

NF_QUEUE could be implemented later if we can guarantee that attachment
of such programs can be rejected if they get attached to a pf/hook that
doesn't support async reinjection.

NF_STOLEN could be implemented via trusted helpers that can guarantee
that the skb will eventually be free'd.

v4: test case for bpf_nf_ctx access checks, requested by Alexei Starovoitov.
v5: also check ctx->{state,skb} can be dereferenced (Alexei).

 # ./test_progs --allow=verifier_netfilter_ctx
 #281/1   verifier_netfilter_ctx/netfilter invalid context access, size too short:OK
 #281/2   verifier_netfilter_ctx/netfilter invalid context access, size too short:OK
 #281/3   verifier_netfilter_ctx/netfilter invalid context access, past end of ctx:OK
 #281/4   verifier_netfilter_ctx/netfilter invalid context, write:OK
 #281/5   verifier_netfilter_ctx/netfilter valid context read and invalid write:OK
 #281/6   verifier_netfilter_ctx/netfilter test prog with skb and state read access:OK
 #281/7   verifier_netfilter_ctx/netfilter test prog with skb and state read access @unpriv:OK
 #281     verifier_netfilter_ctx:OK
Summary: 1/7 PASSED, 0 SKIPPED, 0 FAILED

This checks:
1/2: partial reads of ctx->{skb,state} are rejected
3. read access past sizeof(ctx) is rejected
4. write to ctx content, e.g. 'ctx->skb = NULL;' is rejected
5. ctx->state content cannot be altered
6. ctx->state and ctx->skb can be dereferenced
7. ... same program fails for unpriv (CAP_NET_ADMIN needed).

Link: https://lore.kernel.org/bpf/20230419021152.sjq4gttphzzy6b5f@dhcp-172-26-102-232.dhcp.thefacebook.com/
Link: https://lore.kernel.org/bpf/20230420201655.77kkgi3dh7fesoll@MacBook-Pro-6.local/Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20230421170300.24115-8-fw@strlen.deSigned-off-by: Alexei Starovoitov <ast@kernel.org>

006c0e44

bpf: add test_run support for netfilter program type · 2b99ef22

Florian Westphal authored Apr 21, 2023

add glue code so a bpf program can be run using userspace-provided
netfilter state and packet/skb.

Default is to use ipv4:output hook point, but this can be overridden by
userspace. Userspace provided netfilter state is restricted, only hook and
protocol families can be overridden and only to ipv4/ipv6.
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20230421170300.24115-7-fw@strlen.deSigned-off-by: Alexei Starovoitov <ast@kernel.org>

2b99ef22

tools: bpftool: print netfilter link info · d0fe92fb

Florian Westphal authored Apr 21, 2023

Dump protocol family, hook and priority value:
$ bpftool link
2: netfilter  prog 14
        ip input prio -128
        pids install(3264)
5: netfilter  prog 14
        ip6 forward prio 21
        pids a.out(3387)
9: netfilter  prog 14
        ip prerouting prio 123
        pids a.out(5700)
10: netfilter  prog 14
        ip input prio 21
        pids test2(5701)

v2: Quentin Monnet suggested to also add 'bpftool net' support:

$ bpftool net
xdp:

tc:

flow_dissector:

netfilter:

        ip prerouting prio 21 prog_id 14
        ip input prio -128 prog_id 14
        ip input prio 21 prog_id 14
        ip forward prio 21 prog_id 14
        ip output prio 21 prog_id 14
        ip postrouting prio 21 prog_id 14

'bpftool net' only dumps netfilter link type, links are sorted by protocol
family, hook and priority.

v5: fix bpf ci failure: libbpf needs small update to prog_type_name[]
    and probe_prog_load helper.
v4: don't fail with -EOPNOTSUPP in libbpf probe_prog_load, update
    prog_type_name[] with "netfilter" entry (bpf ci)
v3: fix bpf.h copy, 'reserved' member was removed (Alexei)
    use p_err, not fprintf (Quentin)
Suggested-by: Quentin Monnet <quentin@isovalent.com>
Link: https://lore.kernel.org/bpf/eeeaac99-9053-90c2-aa33-cc1ecb1ae9ca@isovalent.com/Reviewed-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20230421170300.24115-6-fw@strlen.deSigned-off-by: Alexei Starovoitov <ast@kernel.org>

d0fe92fb

netfilter: disallow bpf hook attachment at same priority · 0bdc6da8

Florian Westphal authored Apr 21, 2023

This is just to avoid ordering issues between multiple bpf programs,
this could be removed later in case it turns out to be too cautious.

bpf prog could still be shared with non-bpf hook, otherwise we'd have to
make conntrack hook registration fail just because a bpf program has
same priority.
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20230421170300.24115-5-fw@strlen.deSigned-off-by: Alexei Starovoitov <ast@kernel.org>

0bdc6da8

netfilter: nfnetlink hook: dump bpf prog id · 506a74db

Florian Westphal authored Apr 21, 2023

This allows userspace ("nft list hooks") to show which bpf program
is attached to which hook.

Without this, user only knows bpf prog is attached at prio
x, y, z at INPUT and FORWARD, but can't tell which program is where.

v4: kdoc fixups (Simon Horman)

Link: https://lore.kernel.org/bpf/ZEELzpNCnYJuZyod@corigine.com/Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20230421170300.24115-4-fw@strlen.deSigned-off-by: Alexei Starovoitov <ast@kernel.org>

506a74db

bpf: minimal support for programs hooked into netfilter framework · fd9c663b

Florian Westphal authored Apr 21, 2023

This adds minimal support for BPF_PROG_TYPE_NETFILTER bpf programs
that will be invoked via the NF_HOOK() points in the ip stack.

Invocation incurs an indirect call. This is not a necessity: Its
possible to add 'DEFINE_BPF_DISPATCHER(nf_progs)' and handle the
program invocation with the same method already done for xdp progs.

This isn't done here to keep the size of this chunk down.

Verifier restricts verdicts to either DROP or ACCEPT.
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20230421170300.24115-3-fw@strlen.deSigned-off-by: Alexei Starovoitov <ast@kernel.org>

fd9c663b

bpf: add bpf_link support for BPF_NETFILTER programs · 84601d6e

Florian Westphal authored Apr 21, 2023

Add bpf_link support skeleton.  To keep this reviewable, no bpf program
can be invoked yet, if a program is attached only a c-stub is called and
not the actual bpf program.

Defaults to 'y' if both netfilter and bpf syscall are enabled in kconfig.

Uapi example usage:
	union bpf_attr attr = { };

	attr.link_create.prog_fd = progfd;
	attr.link_create.attach_type = 0; /* unused */
	attr.link_create.netfilter.pf = PF_INET;
	attr.link_create.netfilter.hooknum = NF_INET_LOCAL_IN;
	attr.link_create.netfilter.priority = -128;

	err = bpf(BPF_LINK_CREATE, &attr, sizeof(attr));

... this would attach progfd to ipv4:input hook.

Such hook gets removed automatically if the calling program exits.

BPF_NETFILTER program invocation is added in followup change.

NF_HOOK_OP_BPF enum will eventually be read from nfnetlink_hook, it
allows to tell userspace which program is attached at the given hook
when user runs 'nft hook list' command rather than just the priority
and not-very-helpful 'this hook runs a bpf prog but I can't tell which
one'.

Will also be used to disallow registration of two bpf programs with
same priority in a followup patch.

v4: arm32 cmpxchg only supports 32bit operand
    s/prio/priority/
v3: restrict prog attachment to ip/ip6 for now, lets lift restrictions if
    more use cases pop up (arptables, ebtables, netdev ingress/egress etc).
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20230421170300.24115-2-fw@strlen.deSigned-off-by: Alexei Starovoitov <ast@kernel.org>

84601d6e

bpftool: Update doc to explain struct_ops register subcommand. · 45cea721

Kui-Feng Lee authored Apr 19, 2023

The "struct_ops register" subcommand now allows for an optional *LINK_DIR*
to be included. This specifies the directory path where bpftool will pin
struct_ops links with the same name as their corresponding map names.
Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
Reviewed-by: Quentin Monnet <quentin@isovalent.com>
Link: https://lore.kernel.org/r/20230420002822.345222-2-kuifeng@meta.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

45cea721

bpftool: Register struct_ops with a link. · 0232b788

Kui-Feng Lee authored Apr 19, 2023

You can include an optional path after specifying the object name for the
'struct_ops register' subcommand.

Since the commit 226bc6ae ("Merge branch 'Transit between BPF TCP
congestion controls.'") has been accepted, it is now possible to create a
link for a struct_ops. This can be done by defining a struct_ops in
SEC(".struct_ops.link") to make libbpf returns a real link. If we don't pin
the links before leaving bpftool, they will disappear. To instruct bpftool
to pin the links in a directory with the names of the maps, we need to
provide the path of that directory.
Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
Reviewed-by: Quentin Monnet <quentin@isovalent.com>
Link: https://lore.kernel.org/r/20230420002822.345222-1-kuifeng@meta.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

0232b788

selftests/bpf: Verify optval=NULL case · 833d67ec

Stanislav Fomichev authored Apr 18, 2023

Make sure we get optlen exported instead of getting EFAULT.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230418225343.553806-3-sdf@google.com

833d67ec

bpf: Don't EFAULT for getsockopt with optval=NULL · 00e74ae0

Stanislav Fomichev authored Apr 18, 2023

Some socket options do getsockopt with optval=NULL to estimate the size
of the final buffer (which is returned via optlen). This breaks BPF
getsockopt assumptions about permitted optval buffer size. Let's enforce
these assumptions only when non-NULL optval is provided.

Fixes: 0d01da6a ("bpf: implement getsockopt and setsockopt hooks")
Reported-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/ZD7Js4fj5YyI2oLd@google.com/T/#mb68daf700f87a9244a15d01d00c3f0e5b08f49f7
Link: https://lore.kernel.org/bpf/20230418225343.553806-2-sdf@google.com

00e74ae0

selftests/xsk: Put MAP_HUGE_2MB in correct argument · 02e93e04

Magnus Karlsson authored Apr 21, 2023

Put the flag MAP_HUGE_2MB in the correct flags argument instead of the
wrong offset argument.

Fixes: 2ddade32 ("selftests/xsk: Fix munmap for hugepage allocated umem")
Reported-by: Kal Cutter Conley <kal.conley@dectris.com>
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230421062208.3772-1-magnus.karlsson@gmail.com

02e93e04

bpf: Fix bpf_refcount_acquire's refcount_t address calculation · 4ab07209

Dave Marchevsky authored Apr 21, 2023

When calculating the address of the refcount_t struct within a local
kptr, bpf_refcount_acquire_impl should add refcount_off bytes to the
address of the local kptr. Due to some missing parens, the function is
incorrectly adding sizeof(refcount_t) * refcount_off bytes. This patch
fixes the calculation.

Due to the incorrect calculation, bpf_refcount_acquire_impl was trying
to refcount_inc some memory well past the end of local kptrs, resulting
in kasan and refcount complaints, as reported in [0]. In that thread,
Florian and Eduard discovered that bpf selftests written in the new
style - with __success and an expected __retval, specifically - were
not actually being run. As a result, selftests added in bpf_refcount
series weren't really exercising this behavior, and thus didn't unearth
the bug.

With this fixed behavior it's safe to revert commit 7c4b96c0
("selftests/bpf: disable program test run for progs/refcounted_kptr.c"),
this patch does so.

  [0] https://lore.kernel.org/bpf/ZEEp+j22imoN6rn9@strlen.de/

Fixes: 7c50b1cb ("bpf: Add bpf_refcount_acquire kfunc")
Reported-by: Florian Westphal <fw@strlen.de>
Reported-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20230421074431.3548349-1-davemarchevsky@fb.com

4ab07209

bpf: Fix race between btf_put and btf_idr walk. · acf1c3d6

Alexei Starovoitov authored Apr 20, 2023

Florian and Eduard reported hard dead lock:
[   58.433327]  _raw_spin_lock_irqsave+0x40/0x50
[   58.433334]  btf_put+0x43/0x90
[   58.433338]  bpf_find_btf_id+0x157/0x240
[   58.433353]  btf_parse_fields+0x921/0x11c0

This happens since btf->refcount can be 1 at the time of btf_put() and
btf_put() will call btf_free_id() which will try to grab btf_idr_lock
and will dead lock.
Avoid the issue by doing btf_put() without locking.

Fixes: 3d78417b ("bpf: Add bpf_btf_find_by_name_kind() helper.")
Fixes: 1e89106d ("bpf: Add bpf_core_add_cands() and wire it into bpf_core_apply_relo_insn().")
Reported-by: Florian Westphal <fw@strlen.de>
Reported-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20230421014901.70908-1-alexei.starovoitov@gmail.com

acf1c3d6

20 Apr, 2023 9 commits

Merge branch 'fix __retval() being always ignored' · 267a6e4e

Alexei Starovoitov authored Apr 20, 2023

Eduard Zingerman says:

====================

Florian Westphal found a bug in test_loader.c processing of __retval tag.
Because of this bug the function test_loader.c:do_prog_test_run()
never executed and all __retval test tags were ignored. See [1].

Fix for this bug uncovers two additional bugs:
- During test_verifier tests migration to inline assembly (see [2])
  I missed the fact that some tests require maps to contain mock values;
- Some issue with a new refcounted_kptr test, which causes kernel to
  produce dead lock and refcount saturation warnings when subject to
  libbpf's bpf_test_run_opts().

This series fixes the bug in __retval() processing, and address the
issue with test maps not being populated. The issue in refcounted_kptr
is not addressed, __retval tags in those tests are commented out.

I found that the following tests depend on test maps being populated:
- progs/verifier_array_access.c
- verifier/value_ptr_arith.c (planned for migration to inline assembly)

Given the small amount of these tests I decided to opt for simple
non-generic solution (see patch #4).

[1] https://lore.kernel.org/bpf/f4c4aee644425842ee6aa8edf1da68f0a8260e7c.camel@gmail.com/T/
[2] https://lore.kernel.org/bpf/20230325025524.144043-1-eddyz87@gmail.com/
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

267a6e4e

selftests/bpf: populate map_array_ro map for verifier_array_access test · cbb110bc

Eduard Zingerman authored Apr 21, 2023

Two test cases:
- "valid read map access into a read-only array 1" and
- "valid read map access into a read-only array 2"

Expect that map_array_ro map is filled with mock data. This logic was
not taken into acount during initial test conversion.

This commit modifies prog_tests/verifier.c entry point for this test
to fill the map.

Fixes: a3c830ae ("selftests/bpf: verifier/array_access.c converted to inline assembly")
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20230420232317.2181776-5-eddyz87@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

cbb110bc

selftests/bpf: add pre bpf_prog_test_run_opts() callback for test_loader · 5b22f4d1

Eduard Zingerman authored Apr 21, 2023

When a test case is annotated with __retval tag the test_loader engine
would use libbpf's bpf_prog_test_run_opts() to do a test run of the
program and compare retvals.

This commit allows to perform arbitrary actions on bpf object right
before test loader invokes bpf_prog_test_run_opts(). This could be
used to setup some state for program execution, e.g. fill some maps.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20230420232317.2181776-4-eddyz87@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

5b22f4d1

selftests/bpf: fix __retval() being always ignored · 7cdddb99

Eduard Zingerman authored Apr 21, 2023

Florian Westphal found a bug in and suggested a fix for test_loader.c
processing of __retval tag. Because of this bug the function
test_loader.c:do_prog_test_run() never executed and all __retval test
tags were ignored.

If this bug is fixed a number of test cases from
progs/verifier_array_access.c fail with retval not matching the
expected value. This test was recently converted to use test_loader.c
and inline assembly in [1]. When doing the conversion I missed the
important detail of test_verifier.c operation: when it creates
fixup_map_array_ro, fixup_map_array_wo and fixup_map_array_small it
populates these maps with a dummy record.

Disabling the __retval checks for the affected verifier_array_access
in this commit to avoid false-postivies in any potential bisects.
The issue is addressed in the next patch.

I verified that the __retval tags are now respected by changing
expected return values for all tests annotated with __retval, and
checking that these tests started to fail.

[1] https://lore.kernel.org/bpf/20230325025524.144043-1-eddyz87@gmail.com/

Fixes: 19a8e06f ("selftests/bpf: Tests execution support for test_loader.c")
Reported-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/bpf/f4c4aee644425842ee6aa8edf1da68f0a8260e7c.camel@gmail.com/T/Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20230420232317.2181776-3-eddyz87@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

7cdddb99

selftests/bpf: disable program test run for progs/refcounted_kptr.c · 7c4b96c0

Eduard Zingerman authored Apr 21, 2023

Florian Westphal found a bug in test_loader.c processing of __retval
tag. Because of this bug the function test_loader.c:do_prog_test_run()
never executed and all __retval test tags were ignored. This hid an
issue with progs/refcounted_kptr.c tests.

When __retval tag bug is fixed and refcounted_kptr.c tests are run
kernel reports various issues and eventually hangs. Shortest reproducer
is the following command run a few times:

  $ for i in $(seq 1 4); do (./test_progs --allow=refcounted_kptr &); done

Commenting out __retval tags for these tests until this issue is resolved.
Reported-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/bpf/f4c4aee644425842ee6aa8edf1da68f0a8260e7c.camel@gmail.com/T/Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20230420232317.2181776-2-eddyz87@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

7c4b96c0

bpftool: Replace "__fallthrough" by a comment to address merge conflict · 4b7ef71a

Quentin Monnet authored Apr 20, 2023

The recent support for inline annotations in control flow graphs
generated by bpftool introduced the usage of the "__fallthrough" macro
in a switch/case block in btf_dumper.c. This change went through the
bpf-next tree, but resulted in a merge conflict in linux-next, because
this macro has been renamed "fallthrough" (no underscores) in the
meantime.

To address the conflict, we temporarily switch to a simple comment
instead of a macro.

Related: commit f7a858bf ("tools: Rename __fallthrough to fallthrough")

Fixes: 9fd49684 ("bpftool: Support inline annotations when dumping the CFG of a program")
Reported-by: Sven Schnelle <svens@linux.ibm.com>
Reported-by: Thomas Richter <tmricht@linux.ibm.com>
Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/all/yt9dttxlwal7.fsf@linux.ibm.com/
Link: https://lore.kernel.org/bpf/20230412123636.2358949-1-tmricht@linux.ibm.com/
Link: https://lore.kernel.org/bpf/20230420003333.90901-1-quentin@isovalent.com

4b7ef71a

Merge branch 'Access variable length array relaxed for integer type' · 780c6983

Alexei Starovoitov authored Apr 19, 2023

Feng zhou says:

====================

From: Feng Zhou <zhoufeng.zf@bytedance.com>

Add support for integer type of accessing variable length array.
Add a selftest to check it.
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

780c6983

selftests/bpf: Add test to access integer type of variable array · 5ff54ded

Feng Zhou authored Apr 20, 2023

Add prog test for accessing integer type of variable array in tracing
program.
In addition, hook load_balance function to access sd->span[0], only
to confirm whether the load is successful. Because there is no direct
way to trigger load_balance call.
Co-developed-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Feng Zhou <zhoufeng.zf@bytedance.com>
Link: https://lore.kernel.org/r/20230420032735.27760-3-zhoufeng.zf@bytedance.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

5ff54ded

bpf: support access variable length array of integer type · 2569c7b8

Feng Zhou authored Apr 20, 2023

After this commit:
bpf: Support variable length array in tracing programs (9c5f8a10)
Trace programs can access variable length array, but for structure
type. This patch adds support for integer type.

Example:
Hook load_balance
struct sched_domain {
	...
	unsigned long span[];
}

The access: sd->span[0].
Co-developed-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Feng Zhou <zhoufeng.zf@bytedance.com>
Link: https://lore.kernel.org/r/20230420032735.27760-2-zhoufeng.zf@bytedance.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

2569c7b8

19 Apr, 2023 1 commit

selftests/xsk: Fix munmap for hugepage allocated umem · 2ddade32

Magnus Karlsson authored Apr 18, 2023

Fix the unmapping of hugepage allocated umems so that they are
properly unmapped. The new test referred to in the fixes label,
introduced a test that allocated a umem that is not a multiple of a 2M
hugepage size. This is fine for mmap() that rounds the size up the
nearest multiple of 2M. But munmap() requires the size to be a
multiple of the hugepage size in order for it to unmap the region. The
current behaviour of not properly unmapping the umem, was discovered
when further additions of tests that require hugepages (unaligned mode
tests only) started failing as the system was running out of
hugepages.

Fixes: c0801598 ("selftests: xsk: Add test UNALIGNED_INV_DESC_4K1_FRAME_SIZE")
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230418143617.27762-1-magnus.karlsson@gmail.com

2ddade32

18 Apr, 2023 7 commits

Merge branch 'Provide bpf_for() and bpf_for_each() by libbpf' · 276dcdd1

Alexei Starovoitov authored Apr 18, 2023

Andrii Nakryiko says:

====================

This patch set moves bpf_for(), bpf_for_each(), and bpf_repeat() macros from
selftests-internal bpf_misc.h header to libbpf-provided bpf_helpers.h header.
To do this in a way to allow users to feature-detect and guard such
bpf_for()/bpf_for_each() uses on old kernels we also extend libbpf to improve
unresolved kfunc calls handling and reporting. This lets us mark
bpf_iter_num_{new,next,destroy}() declarations as __weak, and thus not fail
program loading outright if such kfuncs are missing on the host kernel.

Patches #1 and #2 do some simple clean ups and logging improvements. Patch #3
adds kfunc call poisoning and log fixup logic and is the hear of this patch
set, effectively. Patch #4 adds selftest for this logic. Patches #4 and #5
move bpf_for()/bpf_for_each()/bpf_repeat() into bpf_helpers.h header and mark
kfuncs as __weak to allow users to feature-detect and guard their uses.
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

276dcdd1

libbpf: mark bpf_iter_num_{new,next,destroy} as __weak · 94dccba7

Andrii Nakryiko authored Apr 17, 2023

Mark bpf_iter_num_{new,next,destroy}() kfuncs declared for
bpf_for()/bpf_repeat() macros as __weak to allow users to feature-detect
their presence and guard bpf_for()/bpf_repeat() loops accordingly for
backwards compatibility with old kernels.

Now that libbpf supports kfunc calls poisoning and better reporting of
unresolved (but called) kfuncs, declaring number iterator kfuncs in
bpf_helpers.h won't degrade user experience and won't cause unnecessary
kernel feature dependencies.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230418002148.3255690-7-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

94dccba7

libbpf: move bpf_for(), bpf_for_each(), and bpf_repeat() into bpf_helpers.h · c5e64741

Andrii Nakryiko authored Apr 17, 2023

To make it easier for bleeding-edge BPF applications, such as sched_ext,
to utilize open-coded iterators, move bpf_for(), bpf_for_each(), and
bpf_repeat() macros from selftests/bpf-internal bpf_misc.h helper, to
libbpf-provided bpf_helpers.h header.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230418002148.3255690-6-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

c5e64741

selftests/bpf: add missing __weak kfunc log fixup test · 30bbfe32

Andrii Nakryiko authored Apr 17, 2023

Add test validating that libbpf correctly poisons and reports __weak
unresolved kfuncs in post-processed verifier log.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230418002148.3255690-5-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

30bbfe32

libbpf: improve handling of unresolved kfuncs · 05b6f766

Andrii Nakryiko authored Apr 17, 2023

Currently, libbpf leaves `call #0` instruction for __weak unresolved
kfuncs, which might lead to a confusing verifier log situations, where
invalid `call #0` will be treated as successfully validated.

We can do better. Libbpf already has an established mechanism of
poisoning instructions that failed some form of resolution (e.g., CO-RE
relocation and BPF map set to not be auto-created). Libbpf doesn't fail
them outright to allow users to guard them through other means, and as
long as BPF verifier can prove that such poisoned instructions cannot be
ever reached, this doesn't consistute an invalid BPF program. If user
didn't guard such code, libbpf will extract few pieces of information to
tie such poisoned instructions back to additional information about what
entitity wasn't resolved (e.g., BPF map name, or CO-RE relocation
information).

__weak unresolved kfuncs fit this model well, so this patch extends
libbpf with poisioning and log fixup logic for kfunc calls.

Note, this poisoning is done only for kfunc *calls*, not kfunc address
resolution (ldimm64 instructions). The former cannot be ever valid, if
reached, so it's safe to poison them. The latter is a valid mechanism to
check if __weak kfunc ksym was resolved, and do necessary guarding and
work arounds based on this result, supported in most recent kernels. As
such, libbpf keeps such ldimm64 instructions as loading zero, never
poisoning them.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230418002148.3255690-4-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

05b6f766

libbpf: report vmlinux vs module name when dealing with ksyms · f709160d

Andrii Nakryiko authored Apr 17, 2023

Currently libbpf always reports "kernel" as a source of ksym BTF type,
which is ambiguous given ksym's BTF can come from either vmlinux or
kernel module BTFs. Make this explicit and log module name, if used BTF
is from kernel module.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230418002148.3255690-3-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

f709160d

libbpf: misc internal libbpf clean ups around log fixup · 3055ddd6

Andrii Nakryiko authored Apr 17, 2023

Normalize internal constants, field names, and comments related to log
fixup. Also add explicit `ext_idx` alias for relocation where relocation
is pointing to extern description for additional information.

No functional changes, just a clean up before subsequent additions.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230418002148.3255690-2-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

3055ddd6

17 Apr, 2023 4 commits

selftests/bpf: Add a selftest for checking subreg equality · 49859de9

Yonghong Song authored Apr 17, 2023

Add a selftest to ensure subreg equality if source register
upper 32bit is 0. Without previous patch, the test will
fail verification.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20230417222139.360607-1-yhs@fb.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

49859de9

bpf: Improve verifier u32 scalar equality checking · 3be49f79

Yonghong Song authored Apr 17, 2023

In [1], I tried to remove bpf-specific codes to prevent certain
llvm optimizations, and add llvm TTI (target transform info) hooks
to prevent those optimizations. During this process, I found
if I enable llvm SimplifyCFG:shouldFoldTwoEntryPHINode
transformation, I will hit the following verification failure with selftests:

  ...
  8: (18) r1 = 0xffffc900001b2230       ; R1_w=map_value(off=560,ks=4,vs=564,imm=0)
  10: (61) r1 = *(u32 *)(r1 +0)         ; R1_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff))
  ; if (skb->tstamp == EGRESS_ENDHOST_MAGIC)
  11: (79) r2 = *(u64 *)(r6 +152)       ; R2_w=scalar() R6=ctx(off=0,imm=0)
  ; if (skb->tstamp == EGRESS_ENDHOST_MAGIC)
  12: (55) if r2 != 0xb9fbeef goto pc+10        ; R2_w=195018479
  13: (bc) w2 = w1                      ; R1_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) R2_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff))
  ; if (test < __NR_TESTS)
  14: (a6) if w1 < 0x9 goto pc+1 16: R0=2 R1_w=scalar(umax=8,var_off=(0x0; 0xf)) R2_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) R6=ctx(off=0,imm=0) R10=fp0
  ;
  16: (27) r2 *= 28                     ; R2_w=scalar(umax=120259084260,var_off=(0x0; 0x1ffffffffc),s32_max=2147483644,u32_max=-4)
  17: (18) r3 = 0xffffc900001b2118      ; R3_w=map_value(off=280,ks=4,vs=564,imm=0)
  19: (0f) r3 += r2                     ; R2_w=scalar(umax=120259084260,var_off=(0x0; 0x1ffffffffc),s32_max=2147483644,u32_max=-4) R3_w=map_value(off=280,ks=4,vs=564,umax=120259084260,var_off=(0x0; 0x1ffffffffc),s32_max=2147483644,u32_max=-4)
  20: (61) r2 = *(u32 *)(r3 +0)
  R3 unbounded memory access, make sure to bounds check any such access
  processed 97 insns (limit 1000000) max_states_per_insn 1 total_states 10 peak_states 10 mark_read 6
  -- END PROG LOAD LOG --
  libbpf: prog 'ingress_fwdns_prio100': failed to load: -13
  libbpf: failed to load object 'test_tc_dtime'
  libbpf: failed to load BPF skeleton 'test_tc_dtime': -13
  ...

At insn 14, with condition 'w1 < 9', register r1 is changed from an arbitrary
u32 value to `scalar(umax=8,var_off=(0x0; 0xf))`. Register r2, however, remains
as an arbitrary u32 value. Current verifier won't claim r1/r2 equality if
the previous mov is alu32 ('w2 = w1').

If r1 upper 32bit value is not 0, we indeed cannot clamin r1/r2 equality
after 'w2 = w1'. But in this particular case, we know r1 upper 32bit value
is 0, so it is safe to claim r1/r2 equality. This patch exactly did this.
For a 32bit subreg mov, if the src register upper 32bit is 0,
it is okay to claim equality between src and dst registers.

With this patch, the above verification sequence becomes

  ...
  8: (18) r1 = 0xffffc9000048e230       ; R1_w=map_value(off=560,ks=4,vs=564,imm=0)
  10: (61) r1 = *(u32 *)(r1 +0)         ; R1_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff))
  ; if (skb->tstamp == EGRESS_ENDHOST_MAGIC)
  11: (79) r2 = *(u64 *)(r6 +152)       ; R2_w=scalar() R6=ctx(off=0,imm=0)
  ; if (skb->tstamp == EGRESS_ENDHOST_MAGIC)
  12: (55) if r2 != 0xb9fbeef goto pc+10        ; R2_w=195018479
  13: (bc) w2 = w1                      ; R1_w=scalar(id=6,umax=4294967295,var_off=(0x0; 0xffffffff)) R2_w=scalar(id=6,umax=4294967295,var_off=(0x0; 0xffffffff))
  ; if (test < __NR_TESTS)
  14: (a6) if w1 < 0x9 goto pc+1        ; R1_w=scalar(id=6,umin=9,umax=4294967295,var_off=(0x0; 0xffffffff))
  ...
  from 14 to 16: R0=2 R1_w=scalar(id=6,umax=8,var_off=(0x0; 0xf)) R2_w=scalar(id=6,umax=8,var_off=(0x0; 0xf)) R6=ctx(off=0,imm=0) R10=fp0
  16: (27) r2 *= 28                     ; R2_w=scalar(umax=224,var_off=(0x0; 0xfc))
  17: (18) r3 = 0xffffc9000048e118      ; R3_w=map_value(off=280,ks=4,vs=564,imm=0)
  19: (0f) r3 += r2
  20: (61) r2 = *(u32 *)(r3 +0)         ; R2_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) R3_w=map_value(off=280,ks=4,vs=564,umax=224,var_off=(0x0; 0xfc),s32_max=252,u32_max=252)
  ...

and eventually the bpf program can be verified successfully.

  [1] https://reviews.llvm.org/D147968Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20230417222134.359714-1-yhs@fb.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

3be49f79

bpf: lirc program type should not require SYS_CAP_ADMIN · 69a8c792

Sean Young authored Apr 17, 2023

Make it possible to load lirc program type with just CAP_BPF. There is
nothing exceptional about lirc programs that means they require
SYS_CAP_ADMIN.

In order to attach or detach a lirc program type you need permission to
open /dev/lirc0; if you have permission to do that, you can alter all
sorts of lirc receiving options. Changing the IR protocol decoder is no
different.

Right now on a typical distribution /dev/lirc devices are only
read/write by root. Ideally we would make them group read/write like
other devices so that local users can use them without becoming root.
Signed-off-by: Sean Young <sean@mess.org>
Link: https://lore.kernel.org/r/ZD0ArKpwnDBJZsrE@gofer.mess.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

69a8c792

bpf: Set skb redirect and from_ingress info in __bpf_tx_skb · 59e498a3

Daniel Borkmann authored Apr 17, 2023

There are some use-cases where it is desirable to use bpf_redirect()
in combination with ifb device, which currently is not supported, for
example, around filtering inbound traffic with BPF to then push it to
ifb which holds the qdisc for shaping in contrast to doing that on the
egress device.

Toke mentions the following case related to OpenWrt:

Because there's not always a single egress on the other side. These are
mainly home routers, which tend to have one or more WiFi devices bridged
to one or more ethernet ports on the LAN side, and a single upstream WAN
port. And the objective is to control the total amount of traffic going
over the WAN link (in both directions), to deal with bufferbloat in the
ISP network (which is sadly still all too prevalent).

In this setup, the traffic can be split arbitrarily between the links
on the LAN side, and the only "single bottleneck" is the WAN link. So we
install both egress and ingress shapers on this, configured to something
like 95-98% of the true link bandwidth, thus moving the queues into the
qdisc layer in the router. It's usually necessary to set the ingress
bandwidth shaper a bit lower than the egress due to being "downstream"
of the bottleneck link, but it does work surprisingly well.

We usually use something like a matchall filter to put all ingress
traffic on the ifb, so doing the redirect from BPF has not been an
immediate requirement thus far. However, it does seem a bit odd that
this is not possible, and we do have a BPF-based filter that layers on
top of this kind of setup, which currently uses u32 as the ingress
filter and so it could presumably be improved to use BPF instead if
that was available.
Reported-by: Toke Høiland-Jørgensen <toke@redhat.com>
Reported-by: Yafang Shao <laoar.shao@gmail.com>
Reported-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://git.openwrt.org/?p=project/qosify.git;a=blob;f=README
Link: https://lore.kernel.org/bpf/875y9yzbuy.fsf@toke.dk
Link: https://lore.kernel.org/r/8cebc8b2b6e967e10cbafe2ffd6795050e74accd.1681739137.git.daniel@iogearbox.netSigned-off-by: Alexei Starovoitov <ast@kernel.org>

59e498a3

16 Apr, 2023 5 commits

Merge branch 'Remove KF_KPTR_GET kfunc flag' · d40f4f68

Alexei Starovoitov authored Apr 16, 2023

David Vernet says:

====================

We've managed to improve the UX for kptrs significantly over the last 9
months. All of the existing use cases which previously had KF_KPTR_GET
kfuncs (struct bpf_cpumask *, struct task_struct *, and struct cgroup *)
have all been updated to be synchronized using RCU. In other words,
their KF_KPTR_GET kfuncs have been removed in favor of KF_RCU |
KF_ACQUIRE kfuncs, with the pointers themselves also being readable from
maps in an RCU read region thanks to the types being RCU safe.

While KF_KPTR_GET was a logical starting point for kptrs, it's become
clear that they're not the correct abstraction. KF_KPTR_GET is a flag
that essentially does nothing other than enforcing that the argument to
a function is a pointer to a referenced kptr map value. At first glance,
that's a useful thing to guarantee to a kfunc. It gives kfuncs the
ability to try and acquire a reference on that kptr without requiring
the BPF prog to do something like this:

struct kptr_type *in_map, *new = NULL;

in_map = bpf_kptr_xchg(&map->value, NULL);
if (in_map) {
	new = bpf_kptr_type_acquire(in_map);
	in_map = bpf_kptr_xchg(&map->value, in_map);
	if (in_map)
		bpf_kptr_type_release(in_map);
}

That's clearly a pretty ugly (and racy) UX, and if using KF_KPTR_GET is
the only alternative, it's better than nothing. However, the problem
with any KF_KPTR_GET kfunc lies in the fact that it always requires some
kind of synchronization in order to safely do an opportunistic acquire
of the kptr in the map. This is because a BPF program running on another
CPU could do a bpf_kptr_xchg() on that map value, and free the kptr
after it's been read by the KF_KPTR_GET kfunc. For example, the
now-removed bpf_task_kptr_get() kfunc did the following:

struct task_struct *bpf_task_kptr_get(struct task_struct **pp)
{
	    struct task_struct *p;

	rcu_read_lock();
	p = READ_ONCE(*pp);
	/* If p is non-NULL, it could still be freed by another CPU,
 	 * so we have to do an opportunistic refcount_inc_not_zero()
	 * and return NULL if the task will be freed after the
	 * current RCU read region.
	 */
	|f (p && !refcount_inc_not_zero(&p->rcu_users))
		p = NULL;
	rcu_read_unlock();

	return p;
}

In other words, the kfunc uses RCU to ensure that the task remains valid
after it's been peeked from the map. However, this is completely
redundant with just defining a KF_RCU kfunc that itself does a
refcount_inc_not_zero(), which is exactly what bpf_task_acquire() now
does.

So, the question of whether KF_KPTR_GET is useful is actually, "Are
there any synchronization mechanisms / safety flags that are required by
certain kptrs, but which are not provided by the verifier to kfuncs?"
The answer to that question today is "No", because every kptr we
currently care about is RCU protected.

Even if the answer ever became "yes", the proper way to support that
referenced kptr type would be to add support for whatever
synchronization mechanism it requires in the verifier, rather than
giving kfuncs a flag that says, "Here's a pointer to a referenced kptr
in a map, do whatever you need to do."

With all that said -- so as to allow us to consolidate the kfunc API,
and simplify the verifier, this patchset removes the KF_KPTR_GET kfunc
flag.
---

This is v2 of this patchset

v1: https://lore.kernel.org/all/20230415103231.236063-1-void@manifault.com/

Changelog:
----------

v1 -> v2:
- Fix KF_RU -> KF_RCU typo in commit summary for patch 2/3, and in cover
  letter (Alexei)
- In order to reduce churn, don't shift all KF_* flags down by 1. We'll
  just fill the now-empty slot the next time we add a flag (Alexei)
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

d40f4f68

bpf,docs: Remove KF_KPTR_GET from documentation · 530474e6

David Vernet authored Apr 16, 2023

A prior patch removed KF_KPTR_GET from the kernel. Now that it's no
longer accessible to kfunc authors, this patch removes it from the BPF
kfunc documentation.
Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230416084928.326135-4-void@manifault.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

530474e6

bpf: Remove KF_KPTR_GET kfunc flag · 7b4ddf39

David Vernet authored Apr 16, 2023

We've managed to improve the UX for kptrs significantly over the last 9
months. All of the existing use cases which previously had KF_KPTR_GET
kfuncs (struct bpf_cpumask *, struct task_struct *, and struct cgroup *)
have all been updated to be synchronized using RCU. In other words,
their KF_KPTR_GET kfuncs have been removed in favor of KF_RCU |
KF_ACQUIRE kfuncs, with the pointers themselves also being readable from
maps in an RCU read region thanks to the types being RCU safe.

While KF_KPTR_GET was a logical starting point for kptrs, it's become
clear that they're not the correct abstraction. KF_KPTR_GET is a flag
that essentially does nothing other than enforcing that the argument to
a function is a pointer to a referenced kptr map value. At first glance,
that's a useful thing to guarantee to a kfunc. It gives kfuncs the
ability to try and acquire a reference on that kptr without requiring
the BPF prog to do something like this:

struct kptr_type *in_map, *new = NULL;

in_map = bpf_kptr_xchg(&map->value, NULL);
if (in_map) {
        new = bpf_kptr_type_acquire(in_map);
        in_map = bpf_kptr_xchg(&map->value, in_map);
        if (in_map)
                bpf_kptr_type_release(in_map);
}

That's clearly a pretty ugly (and racy) UX, and if using KF_KPTR_GET is
the only alternative, it's better than nothing. However, the problem
with any KF_KPTR_GET kfunc lies in the fact that it always requires some
kind of synchronization in order to safely do an opportunistic acquire
of the kptr in the map. This is because a BPF program running on another
CPU could do a bpf_kptr_xchg() on that map value, and free the kptr
after it's been read by the KF_KPTR_GET kfunc. For example, the
now-removed bpf_task_kptr_get() kfunc did the following:

struct task_struct *bpf_task_kptr_get(struct task_struct **pp)
{
            struct task_struct *p;

        rcu_read_lock();
        p = READ_ONCE(*pp);
        /* If p is non-NULL, it could still be freed by another CPU,
         * so we have to do an opportunistic refcount_inc_not_zero()
         * and return NULL if the task will be freed after the
         * current RCU read region.
         */
        |f (p && !refcount_inc_not_zero(&p->rcu_users))
                p = NULL;
        rcu_read_unlock();

        return p;
}

In other words, the kfunc uses RCU to ensure that the task remains valid
after it's been peeked from the map. However, this is completely
redundant with just defining a KF_RCU kfunc that itself does a
refcount_inc_not_zero(), which is exactly what bpf_task_acquire() now
does.

So, the question of whether KF_KPTR_GET is useful is actually, "Are
there any synchronization mechanisms / safety flags that are required by
certain kptrs, but which are not provided by the verifier to kfuncs?"
The answer to that question today is "No", because every kptr we
currently care about is RCU protected.

Even if the answer ever became "yes", the proper way to support that
referenced kptr type would be to add support for whatever
synchronization mechanism it requires in the verifier, rather than
giving kfuncs a flag that says, "Here's a pointer to a referenced kptr
in a map, do whatever you need to do."

With all that said -- so as to allow us to consolidate the kfunc API,
and simplify the verifier a bit, this patch removes KF_KPTR_GET, and all
relevant logic from the verifier.
Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230416084928.326135-3-void@manifault.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

7b4ddf39

bpf: Remove bpf_kfunc_call_test_kptr_get() test kfunc · 09b501d9

David Vernet authored Apr 16, 2023

We've managed to improve the UX for kptrs significantly over the last 9
months. All of the prior main use cases, struct bpf_cpumask *, struct
task_struct *, and struct cgroup *, have all been updated to be
synchronized mainly using RCU. In other words, their KF_ACQUIRE kfunc
calls are all KF_RCU, and the pointers themselves are MEM_RCU and can be
accessed in an RCU read region in BPF.

In a follow-on change, we'll be removing the KF_KPTR_GET kfunc flag.
This patch prepares for that by removing the
bpf_kfunc_call_test_kptr_get() kfunc, and all associated selftests.
Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230416084928.326135-2-void@manifault.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

09b501d9

Merge branch 'Shared ownership for local kptrs' · 7a0788fe

Alexei Starovoitov authored Apr 15, 2023

Dave Marchevsky says:

====================

This series adds support for refcounted local kptrs to the verifier. A local
kptr is 'refcounted' if its type contains a struct bpf_refcount field:

  struct refcounted_node {
    long data;
    struct bpf_list_node ll;
    struct bpf_refcount ref;
  };

bpf_refcount is used to implement shared ownership for local kptrs.

Motivating usecase
==================

If a struct has two collection node fields, e.g.:

  struct node {
    long key;
    long val;
    struct bpf_rb_node rb;
    struct bpf_list_node ll;
  };

It's not currently possible to add a node to both the list and rbtree:

  long bpf_prog(void *ctx)
  {
    struct node *n = bpf_obj_new(typeof(*n));
    if (!n) { /* ... */ }

    bpf_spin_lock(&lock);

    bpf_list_push_back(&head, &n->ll);
    bpf_rbtree_add(&root, &n->rb, less); /* Assume a resonable less() */
    bpf_spin_unlock(&lock);
  }

The above program will fail verification due to current owning / non-owning ref
logic: after bpf_list_push_back, n is a non-owning reference and thus cannot be
passed to bpf_rbtree_add. The only way to get an owning reference for the node
that was added is to bpf_list_pop_{front,back} it.

More generally, verifier ownership semantics expect that a node has one
owner (program, collection, or stashed in map) with exclusive ownership
of the node's lifetime. The owner free's the node's underlying memory when it
itself goes away.

Without a shared ownership concept it's impossible to express many real-world
usecases such that they pass verification.

Semantic Changes
================

Before this series, the verifier could make this statement: "whoever has the
owning reference has exclusive ownership of the referent's lifetime". As
demonstrated in the previous section, this implies that a BPF program can't
have an owning reference to some node if that node is in a collection. If
such a state were possible, the node would have multiple owners, each thinking
they have exclusive ownership. In order to support shared ownership it's
necessary to modify the exclusive ownership semantic.

After this series' changes, an owning reference has ownership of the referent's
lifetime, but it's not necessarily exclusive. The referent's underlying memory
is guaranteed to be valid (i.e. not free'd) until the reference is dropped or
used for collection insert.

This change doesn't affect UX of owning or non-owning references much:

  * insert kfuncs (bpf_rbtree_add, bpf_list_push_{front,back}) still require
    an owning reference arg, as ownership still must be passed to the
    collection in a shared-ownership world.

  * non-owning references still refer to valid memory without claiming
    any ownership.

One important conclusion that followed from "exclusive ownership" statement
is no longer valid, though. In exclusive-ownership world, if a BPF prog has
an owning reference to a node, the verifier can conclude that no collection has
ownership of it. This conclusion was used to avoid runtime checking in the
implementations of insert and remove operations (""has the node already been
{inserted, removed}?").

In a shared-ownership world the aforementioned conclusion is no longer valid,
which necessitates doing runtime checking in insert and remove operation
kfuncs, and those functions possibly failing to insert or remove anything.

Luckily the verifier changes necessary to go from exclusive to shared ownership
were fairly minimal. Patches in this series which do change verifier semantics
generally have some summary dedicated to explaining why certain usecases
Just Work for shared ownership without verifier changes.

Implementation
==============

The changes in this series can be categorized as follows:

  * struct bpf_refcount opaque field + plumbing
  * support for refcounted kptrs in bpf_obj_new and bpf_obj_drop
  * bpf_refcount_acquire kfunc
    * enables shared ownershp by bumping refcount + acquiring owning ref
  * support for possibly-failing collection insertion and removal
    * insertion changes are more complex

If a patch's changes have some nuance to their effect - or lack of effect - on
verifier behavior, the patch summary talks about it at length.

Patch contents:
  * Patch 1 removes btf_field_offs struct
  * Patch 2 adds struct bpf_refcount and associated plumbing
  * Patch 3 modifies semantics of bpf_obj_drop and bpf_obj_new to handle
    refcounted kptrs
  * Patch 4 adds bpf_refcount_acquire
  * Patches 5-7 add support for possibly-failing collection insert and remove
  * Patch 8 centralizes constructor-like functionality for local kptr types
  * Patch 9 adds tests for new functionality

base-commit: 4a1e885c

Changelog:

v1 -> v2: lore.kernel.org/bpf/20230410190753.2012798-1-davemarchevsky@fb.com

Patch #s used below refer to the patch's position in v1 unless otherwise
specified.

  * General
    * Rebase onto latest bpf-next (base-commit updated above)

  * Patch 4 - "bpf: Add bpf_refcount_acquire kfunc"
    * Fix typo in summary (Alexei)
  * Patch 7 - "Migrate bpf_rbtree_remove to possibly fail"
    * Modify a paragraph in patch summary to more clearly state that only
      bpf_rbtree_remove's non-owning ref clobbering behavior is changed by the
      patch (Alexei)
    * refcount_off == -1 -> refcount_off < 0  in "node type w/ both list
      and rb_node fields" check, since any negative value means "no
      bpf_refcount field found", and furthermore refcount_off is never
      explicitly set to -1, but rather -EINVAL. (Alexei)
    * Instead of just changing "btf: list_node and rb_node in same struct" test
      expectation to pass instead of fail, do some refactoring to test both
      "list_node, rb_node, and bpf_refcount" (success) and "list_node, rb_node,
      _no_ bpf_refcount" (failure) cases. This ensures that logic change in
      previous bullet point is correct.
      * v1's "btf: list_node and rb_node in same struct" test changes didn't
        add bpf_refcount, so the fact that btf load succeeded w/ list and
        rb_nodes but no bpf_refcount field is further proof that this logic
        was incorrect in v1.
  * Patch 8 - "bpf: Centralize btf_field-specific initialization logic"
    * Instead of doing __init_field_infer_size in kfuncs when taking
      bpf_list_head type input which might've been 0-initialized in map, go
      back to simple oneliner initialization. Add short comment explaining why
      this is necessary. (Alexei)
  * Patch 9 - "selftests/bpf: Add refcounted_kptr tests"
    * Don't __always_inline helper fns in progs/refcounted_kptr.c (Alexei)
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

7a0788fe