Commits · 4aadd2920b81b3d7e5c8ac63c7d5d673f3c8aaeb · Kirill Smelkov / linux

26 May, 2023 2 commits

libbpf: Ensure FD >= 3 during bpf_map__reuse_fd() · 4aadd292

Andrii Nakryiko authored May 25, 2023

Improve bpf_map__reuse_fd() logic and ensure that dup'ed map FD is
"good" (>= 3) and has O_CLOEXEC flags. Use fcntl(F_DUPFD_CLOEXEC) for
that, similarly to ensure_good_fd() helper we already use in low-level
APIs that work with bpf() syscall.
Suggested-by: Lennart Poettering <lennart@poettering.net>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230525221311.2136408-2-andrii@kernel.org

4aadd292

libbpf: Ensure libbpf always opens files with O_CLOEXEC · 59842c54

Andrii Nakryiko authored May 25, 2023

Make sure that libbpf code always gets FD with O_CLOEXEC flag set,
regardless if file is open through open() or fopen(). For the latter
this means to add "e" to mode string, which is supported since pretty
ancient glibc v2.7.

Also drop the outdated TODO comment in usdt.c, which was already completed.
Suggested-by: Lennart Poettering <lennart@poettering.net>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230525221311.2136408-1-andrii@kernel.org

59842c54

25 May, 2023 3 commits

selftests/bpf: Check whether to run selftest · 321a64b3

Daniel Müller authored May 25, 2023

The sockopt test invokes test__start_subtest and then unconditionally
asserts the success. That means that even if deny-listed, any test will
still run and potentially fail.
Evaluate the return value of test__start_subtest() to achieve the
desired behavior, as other tests do.
Signed-off-by: Daniel Müller <deso@posteo.net>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230525232248.640465-1-deso@posteo.net

321a64b3

libbpf: Change var type in datasec resize func · 4c857a71

JP Kobryn authored May 24, 2023

This changes a local variable type that stores a new array id to match
the return type of btf__add_array().
Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20230525001323.8554-1-inwardvessel@gmail.com

4c857a71

bpf: drop unnecessary bpf_capable() check in BPF_MAP_FREEZE command · c4c84f6f

Andrii Nakryiko authored May 24, 2023

Seems like that extra bpf_capable() check in BPF_MAP_FREEZE handler was
unintentionally left when we switched to a model that all BPF map
operations should be allowed regardless of CAP_BPF (or any other
capabilities), as long as process got BPF map FD somehow.

This patch replaces bpf_capable() check in BPF_MAP_FREEZE handler with
writeable access check, given conceptually freezing the map is modifying
it: map becomes unmodifiable for subsequent updates.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230524225421.1587859-2-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

c4c84f6f

24 May, 2023 3 commits

Merge branch 'libbpf: capability for resizing datasec maps' · fcf1fa29

Andrii Nakryiko authored May 24, 2023

JP Kobryn says:

====================
Due to the way the datasec maps like bss, data, rodata are memory
mapped, they cannot be resized with bpf_map__set_value_size() like
non-datasec maps can. This series offers a way to allow the resizing of
datasec maps, by having the mapped regions resized as needed and also
adjusting associated BTF info if possible.

The thought behind this is to allow for use cases where a given datasec
needs to scale to for example the number of CPU's present. A bpf program
can have a global array in a data section with an initial length and
before loading the bpf program, the array length could be extended to
match the CPU count. The selftests included in this series perform this
scaling to an arbitrary value to demonstrate how it can work.
====================
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

fcf1fa29

libbpf: Selftests for resizing datasec maps · 08b08956

JP Kobryn authored May 23, 2023

This patch adds test coverage for resizing datasec maps. The first two
subtests resize the bss and custom data sections. In both cases, an
initial array (of length one) has its element set to one. After resizing
the rest of the array is filled with ones as well. A BPF program is then
run to sum the respective arrays and back on the userspace side the sum
is checked to be equal to the number of elements.
The third subtest attempts to perform resizing under conditions that
will result in either the resize failing or the BTF info being cleared.
Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20230524004537.18614-3-inwardvessel@gmail.com

08b08956

libbpf: Add capability for resizing datasec maps · 9d0a2331

JP Kobryn authored May 23, 2023

This patch updates bpf_map__set_value_size() so that if the given map is
memory mapped, it will attempt to resize the mapped region. Initial
contents of the mapped region are preserved. BTF is not required, but
after the mapping is resized an attempt is made to adjust the associated
BTF information if the following criteria is met:
 - BTF info is present
 - the map is a datasec
 - the final variable in the datasec is an array

... the resulting BTF info will be updated so that the final array
variable is associated with a new BTF array type sized to cover the
requested size.

Note that the initial resizing of the memory mapped region can succeed
while the subsequent BTF adjustment can fail. In this case, BTF info is
dropped from the map by clearing the key and value type.
Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20230524004537.18614-2-inwardvessel@gmail.com

9d0a2331

23 May, 2023 7 commits

selftests/bpf: Add path_fd-based BPF_OBJ_PIN and BPF_OBJ_GET tests · 3b22f98e

Andrii Nakryiko authored May 23, 2023

Add a selftest demonstrating using detach-mounted BPF FS using new mount
APIs, and pinning and getting BPF map using such mount. This
demonstrates how something like container manager could setup BPF FS,
pin and adjust all the necessary objects in it, all before exposing BPF
FS to a particular mount namespace.

Also add a few subtests validating all meaningful combinations of
path_fd and pathname. We use mounted /sys/fs/bpf location for these.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230523170013.728457-5-andrii@kernel.org

3b22f98e

libbpf: Add opts-based bpf_obj_pin() API and add support for path_fd · f1674dc7

Andrii Nakryiko authored May 15, 2023

Add path_fd support for bpf_obj_pin() and bpf_obj_get() operations
(through their opts-based variants). This allows to take advantage of
new kernel-side support for O_PATH-based pin/get location specification.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230523170013.728457-4-andrii@kernel.org

f1674dc7

bpf: Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands · cb8edce2

Andrii Nakryiko authored May 15, 2023

Current UAPI of BPF_OBJ_PIN and BPF_OBJ_GET commands of bpf() syscall
forces users to specify pinning location as a string-based absolute or
relative (to current working directory) path. This has various
implications related to security (e.g., symlink-based attacks), forces
BPF FS to be exposed in the file system, which can cause races with
other applications.

One of the feedbacks we got from folks working with containers heavily
was that inability to use purely FD-based location specification was an
unfortunate limitation and hindrance for BPF_OBJ_PIN and BPF_OBJ_GET
commands. This patch closes this oversight, adding path_fd field to
BPF_OBJ_PIN and BPF_OBJ_GET UAPI, following conventions established by
*at() syscalls for dirfd + pathname combinations.

This now allows interesting possibilities like working with detached BPF
FS mount (e.g., to perform multiple pinnings without running a risk of
someone interfering with them), and generally making pinning/getting
more secure and not prone to any races and/or security attacks.

This is demonstrated by a selftest added in subsequent patch that takes
advantage of new mount APIs (fsopen, fsconfig, fsmount) to demonstrate
creating detached BPF FS mount, pinning, and then getting BPF map out of
it, all while never exposing this private instance of BPF FS to outside
worlds.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/bpf/20230523170013.728457-4-andrii@kernel.org

cb8edce2

libbpf: Start v1.3 development cycle · 2b001b94

Andrii Nakryiko authored May 23, 2023

Bump libbpf.map to v1.3.0 to start a new libbpf version cycle.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230523170013.728457-3-andrii@kernel.org

2b001b94

bpf: Validate BPF object in BPF_OBJ_PIN before calling LSM · e7d85427

Andrii Nakryiko authored May 22, 2023

Do a sanity check whether provided file-to-be-pinned is actually a BPF
object (prog, map, btf) before calling security_path_mknod LSM hook. If
it's not, LSM hook doesn't have to be triggered, as the operation has no
chance of succeeding anyways.
Suggested-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/bpf/20230522232917.2454595-2-andrii@kernel.org

e7d85427

bpftool: Specify XDP Hints ifname when loading program · f46392ee

Larysa Zaremba authored May 17, 2023

Add ability to specify a network interface used to resolve XDP hints
kfuncs when loading program through bpftool.

Usage:

  bpftool prog load [...] xdpmeta_dev <ifname>

Writing just 'dev <ifname>' instead of 'xdpmeta_dev' is a very probable
mistake that results in not very descriptive errors,
so 'bpftool prog load [...] dev <ifname>' syntax becomes deprecated,
followed by 'bpftool map create [...] dev <ifname>' for consistency.

Now, to offload program, execute:

  bpftool prog load [...] offload_dev <ifname>

To offload map:

  bpftool map create [...] offload_dev <ifname>

'dev <ifname>' still performs offloading in the commands above, but now
triggers a warning and is excluded from bash completion.

'xdpmeta_dev' and 'offload_dev' are mutually exclusive options, because
'xdpmeta_dev' basically makes a program device-bound without loading it
onto the said device. For now, offloaded programs cannot use XDP hints [0],
but if this changes, using 'offload_dev <ifname>' should cover this case.

  [0] https://lore.kernel.org/bpf/a5a636cc-5b03-686f-4be0-000383b05cfc@linux.devSigned-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Quentin Monnet <quentin@isovalent.com>
Link: https://lore.kernel.org/bpf/20230517160103.1088185-1-larysa.zaremba@intel.com

f46392ee

selftests/bpf: Add xdp_feature selftest for bond device · 6cc385d2

Lorenzo Bianconi authored May 17, 2023

Introduce selftests to check xdp_feature support for bond driver.
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jussi Maki <joamaki@gmail.com>
Link: https://lore.kernel.org/bpf/64cb8f20e6491f5b971f8d3129335093c359aad7.1684329998.git.lorenzo@kernel.org

6cc385d2

20 May, 2023 10 commits

Merge branch 'bpf: Add socket destroy capability' · 18f55887

Martin KaFai Lau authored May 19, 2023

Aditi Ghag says:

====================

This patch set adds the capability to destroy sockets in BPF. We plan to
use the capability in Cilium to force client sockets to reconnect when
their remote load-balancing backends are deleted. The other use case is
on-the-fly policy enforcement where existing socket connections
prevented by policies need to be terminated.

The use cases, and more details around
the selected approach were presented at LPC 2022 -
https://lpc.events/event/16/contributions/1358/.
RFC discussion -
https://lore.kernel.org/netdev/CABG=zsBEh-P4NXk23eBJw7eajB5YJeRS7oPXnTAzs=yob4EMoQ@mail.gmail.com/T/#u.
v8 patch series -
https://lore.kernel.org/bpf/20230517175359.527917-1-aditi.ghag@isovalent.com/

v9 highlights:
Address review comments:
Martin:
- Rearranged the kfunc filter patch, and added the missing break
  statement.
- Squashed the extended selftest/bpf patch.
Yonghong:
- Revised commit message for patch 1.

(Below notes are same as v8 patch series that are still relevant. Refer to
earlier patch series versions for other notes.)
- I hit a snag while writing the kfunc where verifier complained about the
  `sock_common` type passed from TCP iterator. With kfuncs, there don't
  seem to be any options available to pass BTF type hints to the verifier
  (equivalent of `ARG_PTR_TO_BTF_ID_SOCK_COMMON`, as was the case with the
  helper).  As a result, I changed the argument type of the sock_destory
  kfunc to `sock_common`.
====================
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

18f55887

selftests/bpf: Test bpf_sock_destroy · 1a8bc229

Aditi Ghag authored May 19, 2023

The test cases for destroying sockets mirror the intended usages of the
bpf_sock_destroy kfunc using iterators.

The destroy helpers set `ECONNABORTED` error code that we can validate
in the test code with client sockets. But UDP sockets have an overriding
error code from `disconnect()` called during abort, so the error code
validation is only done for TCP sockets.

The failure test cases validate that the `bpf_sock_destroy` kfunc is not
allowed from program attach types other than BPF trace iterator, and
such programs fail to load.
Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
Link: https://lore.kernel.org/r/20230519225157.760788-10-aditi.ghag@isovalent.comSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>

1a8bc229

selftests/bpf: Add helper to get port using getsockname · 176ba657

Aditi Ghag authored May 19, 2023

The helper will be used to programmatically retrieve
and pass ports in userspace and kernel selftest programs.
Suggested-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
Link: https://lore.kernel.org/r/20230519225157.760788-9-aditi.ghag@isovalent.comSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>

176ba657

bpf: Add bpf_sock_destroy kfunc · 4ddbcb88

Aditi Ghag authored May 19, 2023

The socket destroy kfunc is used to forcefully terminate sockets from
certain BPF contexts. We plan to use the capability in Cilium
load-balancing to terminate client sockets that continue to connect to
deleted backends. The other use case is on-the-fly policy enforcement
where existing socket connections prevented by policies need to be
forcefully terminated. The kfunc also allows terminating sockets that may
or may not be actively sending traffic.

The kfunc can currently be called only from BPF TCP and UDP iterators
where users can filter, and terminate selected sockets. More
specifically, it can only be called from BPF contexts that ensure
socket locking in order to allow synchronous execution of protocol
specific `diag_destroy` handlers. The previous commit that batches UDP
sockets during iteration facilitated a synchronous invocation of the UDP
destroy callback from BPF context by skipping socket locks in
`udp_abort`. TCP iterator already supported batching of sockets being
iterated. To that end, `tracing_iter_filter` callback filter is added so
that verifier can restrict the kfunc to programs with `BPF_TRACE_ITER`
attach type, and reject other programs.

The kfunc takes `sock_common` type argument, even though it expects, and
casts them to a `sock` pointer. This enables the verifier to allow the
sock_destroy kfunc to be called for TCP with `sock_common` and UDP with
`sock` structs. Furthermore, as `sock_common` only has a subset of
certain fields of `sock`, casting pointer to the latter type might not
always be safe for certain sockets like request sockets, but these have a
special handling in the diag_destroy handlers.

Additionally, the kfunc is defined with `KF_TRUSTED_ARGS` flag to avoid the
cases where a `PTR_TO_BTF_ID` sk is obtained by following another pointer.
eg. getting a sk pointer (may be even NULL) by following another sk
pointer. The pointer socket argument passed in TCP and UDP iterators is
tagged as `PTR_TRUSTED` in {tcp,udp}_reg_info. The TRUSTED arg changes
are contributed by Martin KaFai Lau <martin.lau@kernel.org>.
Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
Link: https://lore.kernel.org/r/20230519225157.760788-8-aditi.ghag@isovalent.comSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>

4ddbcb88

bpf: Add kfunc filter function to 'struct btf_kfunc_id_set' · e924e80e

Aditi Ghag authored May 19, 2023

This commit adds the ability to filter kfuncs to certain BPF program
types. This is required to limit bpf_sock_destroy kfunc implemented in
follow-up commits to programs with attach type 'BPF_TRACE_ITER'.

The commit adds a callback filter to 'struct btf_kfunc_id_set'. The
filter has access to the `bpf_prog` construct including its properties
such as `expected_attached_type`.
Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
Link: https://lore.kernel.org/r/20230519225157.760788-7-aditi.ghag@isovalent.comSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>

e924e80e

bpf: udp: Implement batching for sockets iterator · c96dac8d

Aditi Ghag authored May 19, 2023

Batch UDP sockets from BPF iterator that allows for overlapping locking
semantics in BPF/kernel helpers executed in BPF programs. This facilitates
BPF socket destroy kfunc (introduced by follow-up patches) to execute from
BPF iterator programs.

Previously, BPF iterators acquired the sock lock and sockets hash table
bucket lock while executing BPF programs. This prevented BPF helpers that
again acquire these locks to be executed from BPF iterators. With the
batching approach, we acquire a bucket lock, batch all the bucket sockets,
and then release the bucket lock. This enables BPF or kernel helpers to
skip sock locking when invoked in the supported BPF contexts.

The batching logic is similar to the logic implemented in TCP iterator:
https://lore.kernel.org/bpf/20210701200613.1036157-1-kafai@fb.com/.
Suggested-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
Link: https://lore.kernel.org/r/20230519225157.760788-6-aditi.ghag@isovalent.comSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>

c96dac8d

udp: seq_file: Remove bpf_seq_afinfo from udp_iter_state · e4fe1bf1

Aditi Ghag authored May 19, 2023

This is a preparatory commit to remove the field. The field was
previously shared between proc fs and BPF UDP socket iterators. As the
follow-up commits will decouple the implementation for the iterators,
remove the field. As for BPF socket iterator, filtering of sockets is
exepected to be done in BPF programs.
Suggested-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
Link: https://lore.kernel.org/r/20230519225157.760788-5-aditi.ghag@isovalent.comSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>

e4fe1bf1

bpf: udp: Encapsulate logic to get udp table · 7625d2e9

Aditi Ghag authored May 19, 2023

This is a preparatory commit that encapsulates the logic
to get udp table in iterator inside udp_get_table_afinfo, and
renames the function to `udp_get_table_seq` accordingly.
Suggested-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
Link: https://lore.kernel.org/r/20230519225157.760788-4-aditi.ghag@isovalent.comSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>

7625d2e9

udp: seq_file: Helper function to match socket attributes · f44b1c51

Aditi Ghag authored May 19, 2023

This is a preparatory commit to refactor code that matches socket
attributes in iterators to a helper function, and use it in the
proc fs iterator.
Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
Link: https://lore.kernel.org/r/20230519225157.760788-3-aditi.ghag@isovalent.comSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>

f44b1c51

bpf: tcp: Avoid taking fast sock lock in iterator · 9378096e

Aditi Ghag authored May 19, 2023

This is a preparatory commit to replace `lock_sock_fast` with
`lock_sock`,and facilitate BPF programs executed from the TCP sockets
iterator to be able to destroy TCP sockets using the bpf_sock_destroy
kfunc (implemented in follow-up commits).

Previously, BPF TCP iterator was acquiring the sock lock with BH
disabled. This led to scenarios where the sockets hash table bucket lock
can be acquired with BH enabled in some path versus disabled in other.
In such situation, kernel issued a warning since it thinks that in the
BH enabled path the same bucket lock *might* be acquired again in the
softirq context (BH disabled), which will lead to a potential dead lock.
Since bpf_sock_destroy also happens in a process context, the potential
deadlock warning is likely a false alarm.

Here is a snippet of annotated stack trace that motivated this change:

```

Possible interrupt unsafe locking scenario:

      CPU0                    CPU1
      ----                    ----
 lock(&h->lhash2[i].lock);
                              local_bh_disable();
                              lock(&h->lhash2[i].lock);
kernel imagined possible scenario:
  local_bh_disable();  /* Possible softirq */
  lock(&h->lhash2[i].lock);
*** Potential Deadlock ***

process context:

lock_acquire+0xcd/0x330
_raw_spin_lock+0x33/0x40
------> Acquire (bucket) lhash2.lock with BH enabled
__inet_hash+0x4b/0x210
inet_csk_listen_start+0xe6/0x100
inet_listen+0x95/0x1d0
__sys_listen+0x69/0xb0
__x64_sys_listen+0x14/0x20
do_syscall_64+0x3c/0x90
entry_SYSCALL_64_after_hwframe+0x72/0xdc

bpf_sock_destroy run from iterator:

lock_acquire+0xcd/0x330
_raw_spin_lock+0x33/0x40
------> Acquire (bucket) lhash2.lock with BH disabled
inet_unhash+0x9a/0x110
tcp_set_state+0x6a/0x210
tcp_abort+0x10d/0x200
bpf_prog_6793c5ca50c43c0d_iter_tcp6_server+0xa4/0xa9
bpf_iter_run_prog+0x1ff/0x340
------> lock_sock_fast that acquires sock lock with BH disabled
bpf_iter_tcp_seq_show+0xca/0x190
bpf_seq_read+0x177/0x450

```

Also, Yonghong reported a deadlock for non-listening TCP sockets that
this change resolves. Previously, `lock_sock_fast` held the sock spin
lock with BH which was again being acquired in `tcp_abort`:

```
watchdog: BUG: soft lockup - CPU#0 stuck for 86s! [test_progs:2331]
RIP: 0010:queued_spin_lock_slowpath+0xd8/0x500
Call Trace:
 <TASK>
 _raw_spin_lock+0x84/0x90
 tcp_abort+0x13c/0x1f0
 bpf_prog_88539c5453a9dd47_iter_tcp6_client+0x82/0x89
 bpf_iter_run_prog+0x1aa/0x2c0
 ? preempt_count_sub+0x1c/0xd0
 ? from_kuid_munged+0x1c8/0x210
 bpf_iter_tcp_seq_show+0x14e/0x1b0
 bpf_seq_read+0x36c/0x6a0

bpf_iter_tcp_seq_show
   lock_sock_fast
     __lock_sock_fast
       spin_lock_bh(&sk->sk_lock.slock);
	/* * Fast path return with bottom halves disabled and * sock::sk_lock.slock held.* */

 ...
 tcp_abort
   local_bh_disable();
   spin_lock(&((sk)->sk_lock.slock)); // from bh_lock_sock(sk)

```

With the switch to `lock_sock`, it calls `spin_unlock_bh` before returning:

```
lock_sock
    lock_sock_nested
       spin_lock_bh(&sk->sk_lock.slock);
       :
       spin_unlock_bh(&sk->sk_lock.slock);
```
Acked-by: Yonghong Song <yhs@meta.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
Link: https://lore.kernel.org/r/20230519225157.760788-2-aditi.ghag@isovalent.comSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>

9378096e

19 May, 2023 3 commits

Merge branch 'bpf: Show target_{obj,btf}_id for tracing link' · 9343184c

Alexei Starovoitov authored May 19, 2023

Yafang Shao says:

====================
The target_btf_id can help us understand which kernel function is
linked by a tracing prog. The target_btf_id and target_obj_id have
already been exposed to userspace, so we just need to show them.

For some other link types like perf_event and kprobe_multi, it is not
easy to find which functions are attached either. We may support
->fill_link_info for them in the future.

v1->v2:
- Skip showing them in the plain output for the old kernels. (Quentin)
- Coding improvement. (Andrii)
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

9343184c

bpftool: Show target_{obj,btf}_id in tracing link info · d7e45eb4

Yafang Shao authored May 17, 2023

The target_btf_id can help us understand which kernel function is
linked by a tracing prog. The target_btf_id and target_obj_id have
already been exposed to userspace, so we just need to show them.

The result as follows,

$ tools/bpf/bpftool/bpftool link show
2: tracing  prog 13
        prog_type tracing  attach_type trace_fentry
        target_obj_id 1  target_btf_id 13964
        pids trace(10673)

$ tools/bpf/bpftool/bpftool link show -j
[{"id":2,"type":"tracing","prog_id":13,"prog_type":"tracing","attach_type":"trace_fentry","target_obj_id":1,"target_btf_id":13964,"pids":[{"pid":10673,"comm":"trace"}]}]
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230517103126.68372-3-laoar.shao@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

d7e45eb4

bpf: Show target_{obj,btf}_id in tracing link fdinfo · e859e429

Yafang Shao authored May 17, 2023

The target_btf_id can help us understand which kernel function is
linked by a tracing prog. The target_btf_id and target_obj_id have
already been exposed to userspace, so we just need to show them.

The result as follows,

$ cat /proc/10673/fdinfo/10
pos:    0
flags:  02000000
mnt_id: 15
ino:    2094
link_type:      tracing
link_id:        2
prog_tag:       a04f5eef06a7f555
prog_id:        13
attach_type:    24
target_obj_id:  1
target_btf_id:  13964
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230517103126.68372-2-laoar.shao@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

e859e429

17 May, 2023 12 commits

selftests/bpf: Make bpf_dynptr_is_rdonly() prototyype consistent with kernel · effcf624

Yonghong Song authored May 16, 2023

Currently kernel kfunc bpf_dynptr_is_rdonly() has prototype ...

  __bpf_kfunc bool bpf_dynptr_is_rdonly(struct bpf_dynptr_kern *ptr)

... while selftests bpf_kfuncs.h has:

  extern int bpf_dynptr_is_rdonly(const struct bpf_dynptr *ptr) __ksym;

Such a mismatch might cause problems although currently it is okay in
selftests. Fix it to prevent future potential surprise.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230517040409.4024618-1-yhs@fb.com

effcf624

selftests/bpf: Fix dynptr/test_dynptr_is_null · 12852f8e

Yonghong Song authored May 16, 2023

With latest llvm17, dynptr/test_dynptr_is_null subtest failed in my testing
VM. The failure log looks like below:

  All error logs:
  tester_init:PASS:tester_log_buf 0 nsec
  process_subtest:PASS:obj_open_mem 0 nsec
  process_subtest:PASS:Can't alloc specs array 0 nsec
  verify_success:PASS:dynptr_success__open 0 nsec
  verify_success:PASS:bpf_object__find_program_by_name 0 nsec
  verify_success:PASS:dynptr_success__load 0 nsec
  verify_success:PASS:bpf_program__attach 0 nsec
  verify_success:FAIL:err unexpected err: actual 4 != expected 0
  #65/9    dynptr/test_dynptr_is_null:FAIL

The error happens for bpf prog test_dynptr_is_null in dynptr_success.c:

        if (bpf_dynptr_is_null(&ptr2)) {
                err = 4;
                goto exit;
        }

The bpf_dynptr_is_null(&ptr) unexpectedly returned a non-zero value and
the control went to the error path. Digging further, I found the root cause
is due to function signature difference between kernel and user space.

In kernel, we have ...

  __bpf_kfunc bool bpf_dynptr_is_null(struct bpf_dynptr_kern *ptr)

... while in bpf_kfuncs.h we have:

  extern int bpf_dynptr_is_null(const struct bpf_dynptr *ptr) __ksym;

The kernel bpf_dynptr_is_null disasm code:

  ffffffff812f1a90 <bpf_dynptr_is_null>:
  ffffffff812f1a90: f3 0f 1e fa           endbr64
  ffffffff812f1a94: 0f 1f 44 00 00        nopl    (%rax,%rax)
  ffffffff812f1a99: 53                    pushq   %rbx
  ffffffff812f1a9a: 48 89 fb              movq    %rdi, %rbx
  ffffffff812f1a9d: e8 ae 29 17 00        callq   0xffffffff81464450 <__asan_load8_noabort>
  ffffffff812f1aa2: 48 83 3b 00           cmpq    $0x0, (%rbx)
  ffffffff812f1aa6: 0f 94 c0              sete    %al
  ffffffff812f1aa9: 5b                    popq    %rbx
  ffffffff812f1aaa: c3                    retq

Note that only 1-byte register %al is set and the other 7-bytes are not
touched. In bpf program, the asm code for the above bpf_dynptr_is_null(&ptr2):

       266:       85 10 00 00 ff ff ff ff call -0x1
       267:       b4 01 00 00 04 00 00 00 w1 = 0x4
       268:       16 00 03 00 00 00 00 00 if w0 == 0x0 goto +0x3 <LBB9_8>

Basically, 4-byte subregister is tested. This might cause error as the value
other than the lowest byte might not be 0.

This patch fixed the issue by using the identical func prototype across kernel
and selftest user space. The fixed bpf asm code:

       267:       85 10 00 00 ff ff ff ff call -0x1
       268:       54 00 00 00 01 00 00 00 w0 &= 0x1
       269:       b4 01 00 00 04 00 00 00 w1 = 0x4
       270:       16 00 03 00 00 00 00 00 if w0 == 0x0 goto +0x3 <LBB9_8>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230517040404.4023912-1-yhs@fb.com

12852f8e

bpf, docs: Shift operations are defined to use a mask · 8819495a

Dave Thaler authored May 09, 2023

Update the documentation regarding shift operations to explain the
use of a mask, since otherwise shifting by a value out of range
(like negative) is undefined.
Signed-off-by: Dave Thaler <dthaler@microsoft.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20230509180845.1236-1-dthaler1968@googlemail.com

8819495a

bpftool: Support bpffs mountpoint as pin path for prog loadall · 2a36c26f

Pengcheng Yang authored May 06, 2023

Currently, when using prog loadall and the pin path is a bpffs mountpoint,
bpffs will be repeatedly mounted to the parent directory of the bpffs
mountpoint path. For example, a `bpftool prog loadall test.o /sys/fs/bpf`
will trigger this.
Signed-off-by: Pengcheng Yang <yangpc@wangsu.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Quentin Monnet <quentin@isovalent.com>
Link: https://lore.kernel.org/bpf/1683342439-3677-1-git-send-email-yangpc@wangsu.com

2a36c26f

selftests/bpf: Do not use sign-file as testcase · f04a32b2

Alexey Gladkov authored May 17, 2023

The sign-file utility (from scripts/) is used in prog_tests/verify_pkcs7_sig.c,
but the utility should not be called as a test. Executing this utility produces
the following error:

selftests: /linux/tools/testing/selftests/bpf: urandom_read
ok 16 selftests: /linux/tools/testing/selftests/bpf: urandom_read

selftests: /linux/tools/testing/selftests/bpf: sign-file
not ok 17 selftests: /linux/tools/testing/selftests/bpf: sign-file # exit=2

Also, urandom_read is mistakenly used as a test. It does not lead to an error,
but should be moved over to TEST_GEN_FILES as well. The empty TEST_CUSTOM_PROGS
can then be removed.

Fixes: fc975906 ("selftests/bpf: Add test for bpf_verify_pkcs7_signature() kfunc")
Signed-off-by: Alexey Gladkov <legion@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Roberto Sassu <roberto.sassu@huawei.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/ZEuWFk3QyML9y5QQ@example.org
Link: https://lore.kernel.org/bpf/88e3ab23029d726a2703adcf6af8356f7a2d3483.1684316821.git.legion@kernel.org

f04a32b2

bpf: drop unnecessary user-triggerable WARN_ONCE in verifierl log · cff36398

Andrii Nakryiko authored May 16, 2023

It's trivial for user to trigger "verifier log line truncated" warning,
as verifier has a fixed-sized buffer of 1024 bytes (as of now), and there are at
least two pieces of user-provided information that can be output through
this buffer, and both can be arbitrarily sized by user:
  - BTF names;
  - BTF.ext source code lines strings.

Verifier log buffer should be properly sized for typical verifier state
output. But it's sort-of expected that this buffer won't be long enough
in some circumstances. So let's drop the check. In any case code will
work correctly, at worst truncating a part of a single line output.

Reported-by: syzbot+8b2a08dfbd25fd933d75@syzkaller.appspotmail.com
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230516180409.3549088-1-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

cff36398

Merge branch 'seltests/xsk: prepare for AF_XDP multi-buffer testing' · 34e78bab

Alexei Starovoitov authored May 16, 2023

Magnus Karlsson says:

====================
Prepare the AF_XDP selftests test framework code for the upcoming
multi-buffer support in AF_XDP. This so that the multi-buffer patch
set does not become way too large. In that upcoming patch set, we are
only including the multi-buffer tests together with any framework
code that depends on the new options bit introduced in the AF_XDP
multi-buffer implementation itself.

Currently, the test framework is based on the premise that a packet
consists of a single fragment and thus occupies a single buffer and a
single descriptor. Multi-buffer breaks this assumption, as that is the
whole purpose of it. Now, a packet can consist of multiple buffers and
therefore consume multiple descriptors.

The patch set starts with some clean-ups and simplifications followed
by patches that make sure that the current code works even when a
packet occupies multiple buffers. The actual code for sending and
receiving multi-buffer packets will be included in the AF_XDP
multi-buffer patch set as it depends on a new bit being used in the
options field of the descriptor.

Patch set anatomy:
1: The XDP program was unnecessarily changed many times. Fixes this.

2: There is no reason to generate a full UDP/IPv4 packet as it is
   never used. Simplify the code by just generating a valid Ethernet
   frame.

3: Introduce a more complicated payload pattern that can detect
   fragments out of bounds in a multi-buffer packet and other errors
   found in single-fragment packets.

4: As a convenience, dump the content of the faulty packet at error.

5: To simplify the code, make the usage of the packet stream for Tx
   and Rx more similar.

6: Store the offset of the packet in the buffer in the struct pkt
   definition instead of the address in the umem itself and introduce
   a simple buffer allocator. The address only made sense when all
   packets consumed a single buffer. Now, we do not know beforehand
   how many buffers a packet will consume, so we instead just allocate
   a buffer from the allocator and specify the offset within that
   buffer.

7: Test for huge pages only once instead of before each test that needs it.

8: Populate the fill ring based on how many frags are needed for each
   packet.

9: Change the data generation code so it can generate data for
   multi-buffer packets too.

10: Adjust the packet pacing algorithm so that it can cope with
    multi-buffer packets. The pacing algorithm is present so that Tx
    does not send too many packets/frames to Rx that it starts to drop
    packets. That would ruin the tests.

v1 -> v2:
* Fixed spelling error in patch #6 [Simon]
* Fixed compilation error with llvm in patch #7 [Daniel]

Thanks: Magnus
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

34e78bab

selftests/xsk: adjust packet pacing for multi-buffer support · 7cd6df4f

Magnus Karlsson authored May 16, 2023

Modify the packet pacing algorithm so that it works with multi-buffer
packets. This algorithm makes sure we do not send too many buffers to
the receiving thread so that packets have to be dropped. The previous
algorithm made the assumption that each packet only consumes one
buffer, but that is not true anymore when multi-buffer support gets
added. Instead, we find out what the largest packet size is in the
packet stream and assume that each packet will consume this many
buffers. This is conservative and overly cautious as there might be
smaller packets in the stream that need fewer buffers per packet. But
it keeps the algorithm simple.

Also simplify it by removing the pthread conditional and just test if
there is enough space in the Rx thread before trying to send one more
batch. Also makes the tests run faster.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/r/20230516103109.3066-11-magnus.karlsson@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

7cd6df4f

selftests/xsk: generate data for multi-buffer packets · 2f6eae0d

Magnus Karlsson authored May 16, 2023

Add the ability to generate data in the packets that are correct for
multi-buffer packets. The ethernet header should only go into the
first fragment followed by data and the others should only have
data. We also need to modify the pkt_dump function so that it knows
what fragment has an ethernet header so it can print this.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/r/20230516103109.3066-10-magnus.karlsson@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

2f6eae0d

selftests/xsk: populate fill ring based on frags needed · 86e41755

Magnus Karlsson authored May 16, 2023

Populate the fill ring based on the number of frags a packet
needs. With multi-buffer support, a packet might require more than a
single fragment/buffer, so the function xsk_populate_fill_ring() needs
to consider how many buffers a packet will consume, and put that many
buffers on the fill ring for each packet it should receive. As we are
still not sending any multi-buffer packets, the function will only
produce one buffer per packet at the moment.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/r/20230516103109.3066-9-magnus.karlsson@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

86e41755

selftests/xsx: test for huge pages only once · 041b68f6

Magnus Karlsson authored May 16, 2023

Test for hugepages only once at the beginning of the execution of the
whole test suite, instead of before each test that needs huge
pages. These are the tests that use unaligned mode. As more unaligned
tests will be added, so the current system just does not scale.

With this change, there are now three possible outcomes of a test run:
fail, pass, or skip. To simplify the handling of this, the function
testapp_validate_traffic() now returns this value to the main loop. As
this function is used by nearly all tests, it meant a small change to
most of them.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/r/20230516103109.3066-8-magnus.karlsson@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

041b68f6

selftests/xsk: store offset in pkt instead of addr · d9f6d970

Magnus Karlsson authored May 16, 2023

Store the offset in struct pkt instead of the address. This is
important since address is only meaningful in the context of a packet
that is stored in a single umem buffer and thus a single Tx
descriptor. If the packet, in contrast need to be represented by
multiple buffers in the umem, storing the address makes no sense since
the packet will consist of multiple buffers in the umem at various
addresses. This change is in preparation for the upcoming
multi-buffer support in AF_XDP and the corresponding tests.

So instead of indicating the address, we instead indicate the offset
of the packet in the first buffer. The actual address of the buffer is
allocated from the umem with a new function called
umem_alloc_buffer(). This also means we can get rid of the
use_fill_for_addr flag as the addresses fed into the fill ring will
always be the offset from the pkt specification in the packet stream
plus the address of the allocated buffer from the umem. No special
casing needed.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/r/20230516103109.3066-7-magnus.karlsson@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

d9f6d970