- 03 Sep, 2022 2 commits
-
-
Martin KaFai Lau authored
This patch changes sk_getsockopt() to take the sockptr_t argument such that it can be used by bpf_getsockopt(SOL_SOCKET) in a latter patch. security_socket_getpeersec_stream() is not changed. It stays with the __user ptr (optval.user and optlen.user) to avoid changes to other security hooks. bpf_getsockopt(SOL_SOCKET) also does not support SO_PEERSEC. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20220902002802.2888419-1-kafai@fb.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Martin KaFai Lau authored
A latter patch refactors bpf_getsockopt(SOL_SOCKET) with the sock_getsockopt() to avoid code duplication and code drift between the two duplicates. The current sock_getsockopt() takes sock ptr as the argument. The very first thing of this function is to get back the sk ptr by 'sk = sock->sk'. bpf_getsockopt() could be called when the sk does not have the sock ptr created. Meaning sk->sk_socket is NULL. For example, when a passive tcp connection has just been established but has yet been accept()-ed. Thus, it cannot use the sock_getsockopt(sk->sk_socket) or else it will pass a NULL ptr. This patch moves all sock_getsockopt implementation to the newly added sk_getsockopt(). The new sk_getsockopt() takes a sk ptr and immediately gets the sock ptr by 'sock = sk->sk_socket' The existing sock_getsockopt(sock) is changed to call sk_getsockopt(sock->sk). All existing callers have both sock->sk and sk->sk_socket pointer. The latter patch will make bpf_getsockopt(SOL_SOCKET) call sk_getsockopt(sk) directly. The bpf_getsockopt(SOL_SOCKET) does not use the optnames that require sk->sk_socket, so it will be safe. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20220902002756.2887884-1-kafai@fb.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
- 02 Sep, 2022 11 commits
-
-
Ian Rogers authored
The put lowers the reference count to 0 and frees ctx, reading it afterwards is invalid. Move the put after the uses and determine the last use by the reference count being 1. Fixes: 39e940d4 ("selftests/xsk: Destroy BPF resources only when ctx refcount drops to 0") Signed-off-by: Ian Rogers <irogers@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20220901202645.1463552-1-irogers@google.com
-
Daniel Müller authored
BPF object files are, in a way, the final artifact produced as part of the ahead-of-time compilation process. That makes them somewhat special compared to "regular" object files, which are a intermediate build artifacts that can typically be removed safely. As such, it can make sense to name them differently to make it easier to spot this difference at a glance. Among others, libbpf-bootstrap [0] has established the extension .bpf.o for BPF object files. It seems reasonable to follow this example and establish the same denomination for selftest build artifacts. To that end, this change adjusts the corresponding part of the build system and the test programs loading BPF object files to work with .bpf.o files. [0] https://github.com/libbpf/libbpf-bootstrapSuggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Müller <deso@posteo.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220901222253.1199242-1-deso@posteo.net
-
Maciej Fijalkowski authored
Introduce new mode to xdpxceiver responsible for testing AF_XDP zero copy support of driver that serves underlying physical device. When setting up test suite, determine whether driver has ZC support or not by trying to bind XSK ZC socket to the interface. If it succeeded, interpret it as ZC support being in place and do softirq and busy poll tests for zero copy mode. Note that Rx dropped tests are skipped since ZC path is not touching rx_dropped stat at all. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20220901114813.16275-7-maciej.fijalkowski@intel.com
-
Maciej Fijalkowski authored
For single threaded poll tests call pthread_kill() from main thread so that we are sure worker thread has finished its job and it is possible to proceed with next test types from test suite. It was observed that on some platforms it takes a bit longer for worker thread to exit and next test case sees device as busy in this case. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20220901114813.16275-6-maciej.fijalkowski@intel.com
-
Maciej Fijalkowski authored
Currently, architecture of xdpxceiver is designed strictly for conducting veth based tests. Veth pair is created together with a network namespace and one of the veth interfaces is moved to the mentioned netns. Then, separate threads for Tx and Rx are spawned which will utilize described setup. Infrastructure described in the paragraph above can not be used for testing AF_XDP support on physical devices. That testing will be conducted on a single network interface and same queue. Xskxceiver needs to be extended to distinguish between veth tests and physical interface tests. Since same iface/queue id pair will be used by both Tx/Rx threads for physical device testing, Tx thread, which happen to run after the Rx thread, is going to create XSK socket with shared umem flag. In order to track this setting throughout the lifetime of spawned threads, introduce 'shared_umem' boolean variable to struct ifobject and set it to true when xdpxceiver is run against physical device. In such case, UMEM size needs to be doubled, so half of it will be used by Rx thread and other half by Tx thread. For two step based test types, value of XSKMAP element under key 0 has to be updated as there is now another socket for the second step. Also, to avoid race conditions when destroying XSK resources, move this activity to the main thread after spawned Rx and Tx threads have finished its job. This way it is possible to gracefully remove shared umem without introducing synchronization mechanisms. To run xsk selftests suite on physical device, append "-i $IFACE" when invoking test_xsk.sh. For veth based tests, simply skip it. When "-i $IFACE" is in place, under the hood test_xsk.sh will use $IFACE for both interfaces supplied to xdpxceiver, which in turn will interpret that this execution of test suite is for a physical device. Note that currently this makes it possible only to test SKB and DRV mode (in case underlying device has native XDP support). ZC testing support is added in a later patch. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20220901114813.16275-5-maciej.fijalkowski@intel.com
-
Maciej Fijalkowski authored
So that "enp240s0f0" or such name can be used against xskxceiver. While at it, also extend character count for netns name. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20220901114813.16275-4-maciej.fijalkowski@intel.com
-
Maciej Fijalkowski authored
In order to prepare xdpxceiver for physical device testing, let us introduce default Rx pkt stream. Reason for doing it is that physical device testing will use a UMEM with a doubled size where half of it will be used by Tx and other half by Rx. This means that pkt addresses will differ for Tx and Rx streams. Rx thread will initialize the xsk_umem_info::base_addr that is added here so that pkt_set(), when working on Rx UMEM will add this offset and second half of UMEM space will be used. Note that currently base_addr is 0 on both sides. Future commit will do the mentioned initialization. Previously, veth based testing worked on separate UMEMs, so single default stream was fine. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20220901114813.16275-3-maciej.fijalkowski@intel.com
-
Maciej Fijalkowski authored
Currently, xdpxceiver assumes that underlying device supports XDP in native mode - it is fine by now since tests can run only on a veth pair. Future commit is going to allow running test suite against physical devices, so let us query the device if it is capable of running XDP programs in native mode. This way xdpxceiver will not try to run TEST_MODE_DRV if device being tested is not supporting it. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20220901114813.16275-2-maciej.fijalkowski@intel.com
-
Shmulik Ladkani authored
Get the tunnel flags in {ipv6}vxlan_get_tunnel_src and ensure they are aligned with tunnel params set at {ipv6}vxlan_set_tunnel_dst. Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220831144010.174110-2-shmulik.ladkani@gmail.com
-
Shmulik Ladkani authored
Existing 'bpf_skb_get_tunnel_key' extracts various tunnel parameters (id, ttl, tos, local and remote) but does not expose ip_tunnel_info's tun_flags to the BPF program. It makes sense to expose tun_flags to the BPF program. Assume for example multiple GRE tunnels maintained on a single GRE interface in collect_md mode. The program expects origins to initiate over GRE, however different origins use different GRE characteristics (e.g. some prefer to use GRE checksum, some do not; some pass a GRE key, some do not, etc..). A BPF program getting tun_flags can therefore remember the relevant flags (e.g. TUNNEL_CSUM, TUNNEL_SEQ...) for each initiating remote. In the reply path, the program can use 'bpf_skb_set_tunnel_key' in order to correctly reply to the remote, using similar characteristics, based on the stored tunnel flags. Introduce BPF_F_TUNINFO_FLAGS flag for bpf_skb_get_tunnel_key. If specified, 'bpf_tunnel_key->tunnel_flags' is set with the tun_flags. Decided to use the existing unused 'tunnel_ext' as the storage for the 'tunnel_flags' in order to avoid changing bpf_tunnel_key's layout. Also, the following has been considered during the design: 1. Convert the "interesting" internal TUNNEL_xxx flags back to BPF_F_yyy and place into the new 'tunnel_flags' field. This has 2 drawbacks: - The BPF_F_yyy flags are from *set_tunnel_key* enumeration space, e.g. BPF_F_ZERO_CSUM_TX. It is awkward that it is "returned" into tunnel_flags from a *get_tunnel_key* call. - Not all "interesting" TUNNEL_xxx flags can be mapped to existing BPF_F_yyy flags, and it doesn't make sense to create new BPF_F_yyy flags just for purposes of the returned tunnel_flags. 2. Place key.tun_flags into 'tunnel_flags' but mask them, keeping only "interesting" flags. That's ok, but the drawback is that what's "interesting" for my usecase might be limiting for other usecases. Therefore I decided to expose what's in key.tun_flags *as is*, which seems most flexible. The BPF user can just choose to ignore bits he's not interested in. The TUNNEL_xxx are also UAPI, so no harm exposing them back in the get_tunnel_key call. Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220831144010.174110-1-shmulik.ladkani@gmail.com
-
Shung-Hsi Yu authored
Commit a657182a ("bpf: Don't use tnum_range on array range checking for poke descriptors") has shown that using tnum_range() as argument to tnum_in() can lead to misleading code that looks like tight bound check when in fact the actual allowed range is much wider. Document such behavior to warn against its usage in general, and suggest some scenario where result can be trusted. Signed-off-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/984b37f9fdf7ac36831d2137415a4a915744c1b6.1661462653.git.daniel@iogearbox.net Link: https://www.openwall.com/lists/oss-security/2022/08/26/1 Link: https://lore.kernel.org/bpf/20220831031907.16133-3-shung-hsi.yu@suse.com Link: https://lore.kernel.org/bpf/20220831031907.16133-2-shung-hsi.yu@suse.com
-
- 01 Sep, 2022 7 commits
-
-
Hou Tao authored
When CONFIG_SECURITY_NETWORK is disabled, there will be build warnings from resolve_btfids: WARN: resolve_btfids: unresolved symbol bpf_lsm_socket_socketpair ...... WARN: resolve_btfids: unresolved symbol bpf_lsm_inet_conn_established Fixing it by wrapping these BTF ID definitions by CONFIG_SECURITY_NETWORK. Fixes: 69fd337a ("bpf: per-cgroup lsm flavor") Fixes: 9113d7e4 ("bpf: expose bpf_{g,s}etsockopt to lsm cgroup") Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20220901065126.3856297-1-houtao@huaweicloud.comSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>
-
Jiapeng Chong authored
The assignment of the else and else if branches is the same, so the else if here is redundant, so we remove it and add a comment to make the code here readable. ./kernel/bpf/cgroup_iter.c:81:6-8: WARNING: possible condition with no effect (if == else). Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=2016Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Link: https://lore.kernel.org/r/20220831021618.86770-1-jiapeng.chong@linux.alibaba.comSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>
-
Martin KaFai Lau authored
Hou Tao says: ==================== From: Hou Tao <houtao1@huawei.com> Hi, The patchset aims to make the update of per-cpu prog->active and per-cpu bpf_task_storage_busy being preemption-safe. The problem is on same architectures (e.g. arm64), __this_cpu_{inc|dec|inc_return} are neither preemption-safe nor IRQ-safe, so under fully preemptible kernel the concurrent updates on these per-cpu variables may be interleaved and the final values of these variables may be not zero. Patch 1 & 2 use the preemption-safe per-cpu helpers to manipulate prog->active and bpf_task_storage_busy. Patch 3 & 4 add a test case in map_tests to show the concurrent updates on the per-cpu bpf_task_storage_busy by using __this_cpu_{inc|dec} are not atomic. Comments are always welcome. Regards, Tao Change Log: v2: * Patch 1: update commit message to indicate the problem is only possible for fully preemptible kernel * Patch 2: a new patch which fixes the problem for prog->active * Patch 3 & 4: move it to test_maps and make it depend on CONFIG_PREEMPT v1: https://lore.kernel.org/bpf/20220829142752.330094-1-houtao@huaweicloud.com/ ==================== Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
-
Hou Tao authored
Under full preemptible kernel, task local storage lookup operations on the same CPU may update per-cpu bpf_task_storage_busy concurrently. If the update of bpf_task_storage_busy is not preemption safe, the final value of bpf_task_storage_busy may become not-zero forever and bpf_task_storage_trylock() will always fail. So add a test case to ensure the update of bpf_task_storage_busy is preemption safe. Will skip the test case when CONFIG_PREEMPT is disabled, and it can only reproduce the problem probabilistically. By increasing TASK_STORAGE_MAP_NR_LOOP and running it under ARM64 VM with 4-cpus, it takes about four rounds to reproduce: > test_maps is modified to only run test_task_storage_map_stress_lookup() $ export TASK_STORAGE_MAP_NR_THREAD=256 $ export TASK_STORAGE_MAP_NR_LOOP=81920 $ export TASK_STORAGE_MAP_PIN_CPU=1 $ time ./test_maps test_task_storage_map_stress_lookup(135):FAIL:bad bpf_task_storage_busy got -2 real 0m24.743s user 0m6.772s sys 0m17.966s Signed-off-by: Hou Tao <houtao1@huawei.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/r/20220901061938.3789460-5-houtao@huaweicloud.comSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>
-
Hou Tao authored
sys_pidfd_open() is defined twice in both test_bprm_opts.c and test_local_storage.c, so move it to a common header file. And it will be used in map_tests as well. Signed-off-by: Hou Tao <houtao1@huawei.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/r/20220901061938.3789460-4-houtao@huaweicloud.comSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>
-
Hou Tao authored
Both __this_cpu_inc_return() and __this_cpu_dec() are not preemption safe and now migrate_disable() doesn't disable preemption, so the update of prog-active is not atomic and in theory under fully preemptible kernel recurisve prevention may do not work. Fixing by using the preemption-safe and IRQ-safe variants. Fixes: ca06f55b ("bpf: Add per-program recursion prevention mechanism") Signed-off-by: Hou Tao <houtao1@huawei.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/r/20220901061938.3789460-3-houtao@huaweicloud.comSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>
-
Hou Tao authored
Now migrate_disable() does not disable preemption and under some architectures (e.g. arm64) __this_cpu_{inc|dec|inc_return} are neither preemption-safe nor IRQ-safe, so for fully preemptible kernel concurrent lookups or updates on the same task local storage and on the same CPU may make bpf_task_storage_busy be imbalanced, and bpf_task_storage_trylock() on the specific cpu will always fail. Fixing it by using this_cpu_{inc|dec|inc_return} when manipulating bpf_task_storage_busy. Fixes: bc235cdb ("bpf: Prevent deadlock from recursive bpf_task_storage_[get|delete]") Signed-off-by: Hou Tao <houtao1@huawei.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/r/20220901061938.3789460-2-houtao@huaweicloud.comSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>
-
- 31 Aug, 2022 7 commits
-
-
Martin KaFai Lau authored
Hou Tao says: ==================== From: Hou Tao <houtao1@huawei.com> Hi, The patchset aims to fix the issues found during investigating the syzkaller problem reported in [0]. It seems that the concurrent updates to the same hash-table bucket may fail as shown in patch 1. Patch 1 uses preempt_disable() to fix the problem for htab_use_raw_lock() case. For !htab_use_raw_lock() case, the problem is left to "BPF specific memory allocator" patchset [1] in which !htab_use_raw_lock() will be removed. Patch 2 fixes the out-of-bound memory read problem reported in [0]. The problem has the root cause as patch 1 and it is fixed by handling -EBUSY from htab_lock_bucket() correctly. Patch 3 add two cases for hash-table update: one for the reentrancy of bpf_map_update_elem(), and another one for concurrent updates of the same hash-table bucket. Comments are always welcome. Regards, Tao [0]: https://lore.kernel.org/bpf/CACkBjsbuxaR6cv0kXJoVnBfL9ZJXjjoUcMpw_Ogc313jSrg14A@mail.gmail.com/ [1]: https://lore.kernel.org/bpf/20220819214232.18784-1-alexei.starovoitov@gmail.com/ Change Log: v4: * rebased on bpf-next * add htab_update to DENYLIST.s390x v3: https://lore.kernel.org/bpf/20220829023709.1958204-1-houtao@huaweicloud.com/ * patch 1: update commit message and add Fixes tag * patch 2: add Fixes tag * patch 3: elaborate the description of test cases v2: https://lore.kernel.org/bpf/bd60ef93-1c6a-2db2-557d-b09b92ad22bd@huaweicloud.com/ * Note the fix is for CONFIG_PREEMPT case in commit message and add Reviewed-by tag for patch 1 * Drop patch "bpf: Allow normally concurrent map updates for !htab_use_raw_lock() case" v1: https://lore.kernel.org/bpf/20220821033223.2598791-1-houtao@huaweicloud.com/ ==================== Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
-
Hou Tao authored
One test demonstrates the reentrancy of hash map update on the same bucket should fail, and another one shows concureently updates of the same hash map bucket should succeed and not fail due to the reentrancy checking for bucket lock. There is no trampoline support on s390x, so move htab_update to denylist. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20220831042629.130006-4-houtao@huaweicloud.comSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>
-
Hou Tao authored
In __htab_map_lookup_and_delete_batch() if htab_lock_bucket() returns -EBUSY, it will go to next bucket. Going to next bucket may not only skip the elements in current bucket silently, but also incur out-of-bound memory access or expose kernel memory to userspace if current bucket_cnt is greater than bucket_size or zero. Fixing it by stopping batch operation and returning -EBUSY when htab_lock_bucket() fails, and the application can retry or skip the busy batch as needed. Fixes: 20b6cc34 ("bpf: Avoid hashtab deadlock with map_locked") Reported-by: Hao Sun <sunhao.th@gmail.com> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20220831042629.130006-3-houtao@huaweicloud.comSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>
-
Hou Tao authored
Per-cpu htab->map_locked is used to prohibit the concurrent accesses from both NMI and non-NMI contexts. But since commit 74d862b6 ("sched: Make migrate_disable/enable() independent of RT"), migrate_disable() is also preemptible under CONFIG_PREEMPT case, so now map_locked also disallows concurrent updates from normal contexts (e.g. userspace processes) unexpectedly as shown below: process A process B htab_map_update_elem() htab_lock_bucket() migrate_disable() /* return 1 */ __this_cpu_inc_return() /* preempted by B */ htab_map_update_elem() /* the same bucket as A */ htab_lock_bucket() migrate_disable() /* return 2, so lock fails */ __this_cpu_inc_return() return -EBUSY A fix that seems feasible is using in_nmi() in htab_lock_bucket() and only checking the value of map_locked for nmi context. But it will re-introduce dead-lock on bucket lock if htab_lock_bucket() is re-entered through non-tracing program (e.g. fentry program). One cannot use preempt_disable() to fix this issue as htab_use_raw_lock being false causes the bucket lock to be a spin lock which can sleep and does not work with preempt_disable(). Therefore, use migrate_disable() when using the spinlock instead of preempt_disable() and defer fixing concurrent updates to when the kernel has its own BPF memory allocator. Fixes: 74d862b6 ("sched: Make migrate_disable/enable() independent of RT") Reviewed-by: Hao Luo <haoluo@google.com> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20220831042629.130006-2-houtao@huaweicloud.comSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>
-
Martin KaFai Lau authored
This patch adds a test to ensure bpf_setsockopt(TCP_CONGESTION, "not_exist") will not trigger the kernel module autoload. Before the fix: [ 40.535829] BUG: sleeping function called from invalid context at include/linux/sched/mm.h:274 [...] [ 40.552134] tcp_ca_find_autoload.constprop.0+0xcb/0x200 [ 40.552689] tcp_set_congestion_control+0x99/0x7b0 [ 40.553203] do_tcp_setsockopt+0x3ed/0x2240 [...] [ 40.556041] __bpf_setsockopt+0x124/0x640 Signed-off-by: Martin KaFai Lau <martin.lau@linux.dev> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220830231953.792412-1-martin.lau@linux.dev
-
Martin KaFai Lau authored
When bpf prog changes tcp-cc by calling bpf_setsockopt(TCP_CONGESTION), it should not try to load module which may be a blocking operation. This details was correct in the v1 [0] but missed by mistake in the later revision in commit cb388e7e ("bpf: net: Change do_tcp_setsockopt() to use the sockopt's lock_sock() and capable()"). This patch fixes it by checking the has_current_bpf_ctx(). [0] https://lore.kernel.org/bpf/20220727060921.2373314-1-kafai@fb.com/ Fixes: cb388e7e ("bpf: net: Change do_tcp_setsockopt() to use the sockopt's lock_sock() and capable()") Signed-off-by: Martin KaFai Lau <martin.lau@linux.dev> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220830231946.791504-1-martin.lau@linux.dev
-
James Hilliard authored
The bpf_tail_call_static function is currently not defined unless using clang >= 8. To support bpf_tail_call_static on GCC we can check if __clang__ is not defined to enable bpf_tail_call_static. We need to use GCC assembly syntax when the compiler does not define __clang__ as LLVM inline assembly is not fully compatible with GCC. Signed-off-by: James Hilliard <james.hilliard1@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220829210546.755377-1-james.hilliard1@gmail.com
-
- 30 Aug, 2022 1 commit
-
-
Hao Luo authored
Support dumping info of a cgroup_iter link. This includes showing the cgroup's id and the order for walking the cgroup hierarchy. Example output is as follows: > bpftool link show 1: iter prog 2 target_name bpf_map 2: iter prog 3 target_name bpf_prog 3: iter prog 12 target_name cgroup cgroup_id 72 order self_only > bpftool -p link show [{ "id": 1, "type": "iter", "prog_id": 2, "target_name": "bpf_map" },{ "id": 2, "type": "iter", "prog_id": 3, "target_name": "bpf_prog" },{ "id": 3, "type": "iter", "prog_id": 12, "target_name": "cgroup", "cgroup_id": 72, "order": "self_only" } ] Signed-off-by: Hao Luo <haoluo@google.com> Reviewed-by: Quentin Monnet <quentin@isovalent.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20220829231828.1016835-1-haoluo@google.comSigned-off-by: Martin KaFai Lau <martin.lau@linux.dev>
-
- 29 Aug, 2022 3 commits
-
-
James Hilliard authored
There is a potential for us to hit a type conflict when including netinet/tcp.h and sys/socket.h, we can replace both of these includes with linux/tcp.h and bpf_tcp_helpers.h to avoid this conflict. Fixes errors like the below when compiling with gcc BPF backend: In file included from /usr/include/netinet/tcp.h:91, from progs/connect4_prog.c:11: /home/buildroot/opt/cross/lib/gcc/bpf/13.0.0/include/stdint.h:34:23: error: conflicting types for 'int8_t'; have 'char' 34 | typedef __INT8_TYPE__ int8_t; | ^~~~~~ In file included from /usr/include/x86_64-linux-gnu/sys/types.h:155, from /usr/include/x86_64-linux-gnu/bits/socket.h:29, from /usr/include/x86_64-linux-gnu/sys/socket.h:33, from progs/connect4_prog.c:10: /usr/include/x86_64-linux-gnu/bits/stdint-intn.h:24:18: note: previous declaration of 'int8_t' with type 'int8_t' {aka 'signed char'} 24 | typedef __int8_t int8_t; | ^~~~~~ /home/buildroot/opt/cross/lib/gcc/bpf/13.0.0/include/stdint.h:43:24: error: conflicting types for 'int64_t'; have 'long int' 43 | typedef __INT64_TYPE__ int64_t; | ^~~~~~~ /usr/include/x86_64-linux-gnu/bits/stdint-intn.h:27:19: note: previous declaration of 'int64_t' with type 'int64_t' {aka 'long long int'} 27 | typedef __int64_t int64_t; | ^~~~~~~ Signed-off-by: James Hilliard <james.hilliard1@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220829154710.3870139-1-james.hilliard1@gmail.com
-
James Hilliard authored
There is a potential for us to hit a type conflict when including netinet/tcp.h with sys/socket.h, we can remove these as they are not actually needed. Fixes errors like the below when compiling with gcc BPF backend: In file included from /usr/include/netinet/tcp.h:91, from progs/bind4_prog.c:10: /home/buildroot/opt/cross/lib/gcc/bpf/13.0.0/include/stdint.h:34:23: error: conflicting types for 'int8_t'; have 'char' 34 | typedef __INT8_TYPE__ int8_t; | ^~~~~~ In file included from /usr/include/x86_64-linux-gnu/sys/types.h:155, from /usr/include/x86_64-linux-gnu/bits/socket.h:29, from /usr/include/x86_64-linux-gnu/sys/socket.h:33, from progs/bind4_prog.c:9: /usr/include/x86_64-linux-gnu/bits/stdint-intn.h:24:18: note: previous declaration of 'int8_t' with type 'int8_t' {aka 'signed char'} 24 | typedef __int8_t int8_t; | ^~~~~~ /home/buildroot/opt/cross/lib/gcc/bpf/13.0.0/include/stdint.h:43:24: error: conflicting types for 'int64_t'; have 'long int' 43 | typedef __INT64_TYPE__ int64_t; | ^~~~~~~ /usr/include/x86_64-linux-gnu/bits/stdint-intn.h:27:19: note: previous declaration of 'int64_t' with type 'int64_t' {aka 'long long int'} 27 | typedef __int64_t int64_t; | ^~~~~~~ make: *** [Makefile:537: /home/buildroot/bpf-next/tools/testing/selftests/bpf/bpf_gcc/bind4_prog.o] Error 1 Signed-off-by: James Hilliard <james.hilliard1@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220826052925.980431-1-james.hilliard1@gmail.com
-
Tiezhu Yang authored
MAX_TAIL_CALL_CNT is 33, so min(MAX_TAIL_CALL_CNT, 0xffff) is always MAX_TAIL_CALL_CNT, it is better to use MAX_TAIL_CALL_CNT directly. At the same time, add BUILD_BUG_ON(MAX_TAIL_CALL_CNT > 0xffff) with a comment on why the assertion is there. Suggested-by: Daniel Borkmann <daniel@iogearbox.net> Suggested-by: Johan Almbladh <johan.almbladh@anyfinetworks.com> Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/1661742309-2320-1-git-send-email-yangtiezhu@loongson.cn
-
- 27 Aug, 2022 2 commits
-
-
Quentin Monnet authored
Address a few typos in the documentation for the BPF helper functions. They were reported by Jakub [0], who ran spell checkers on the generated man page [1]. [0] https://lore.kernel.org/linux-man/d22dcd47-023c-8f52-d369-7b5308e6c842@gmail.com/T/#mb02e7d4b7fb61d98fa914c77b581184e9a9537af [1] https://lore.kernel.org/linux-man/eb6a1e41-c48e-ac45-5154-ac57a2c76108@gmail.com/T/#m4a8d1b003616928013ffcd1450437309ab652f9f v3: Do not copy unrelated (and breaking) elements to tools/ header v2: Turn a ',' into a ';' Reported-by: Jakub Wilk <jwilk@jwilk.net> Signed-off-by: Quentin Monnet <quentin@isovalent.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220825220806.107143-1-quentin@isovalent.com
-
James Hilliard authored
Due to bpf_map_lookup_elem being declared static we need to also declare subprog_noise as static. Fixes the following error: progs/tailcall_bpf2bpf4.c:26:9: error: 'bpf_map_lookup_elem' is static but used in inline function 'subprog_noise' which is not static [-Werror] 26 | bpf_map_lookup_elem(&nop_table, &key); | ^~~~~~~~~~~~~~~~~~~ Signed-off-by: James Hilliard <james.hilliard1@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/bpf/20220826035141.737919-1-james.hilliard1@gmail.com
-
- 26 Aug, 2022 3 commits
-
-
James Hilliard authored
The sys/socket.h header isn't required to build test_tc_dtime and may cause a type conflict. Fixes the following error: In file included from /usr/include/x86_64-linux-gnu/sys/types.h:155, from /usr/include/x86_64-linux-gnu/bits/socket.h:29, from /usr/include/x86_64-linux-gnu/sys/socket.h:33, from progs/test_tc_dtime.c:18: /usr/include/x86_64-linux-gnu/bits/stdint-intn.h:24:18: error: conflicting types for 'int8_t'; have '__int8_t' {aka 'signed char'} 24 | typedef __int8_t int8_t; | ^~~~~~ In file included from progs/test_tc_dtime.c:5: /home/buildroot/opt/cross/lib/gcc/bpf/13.0.0/include/stdint.h:34:23: note: previous declaration of 'int8_t' with type 'int8_t' {aka 'char'} 34 | typedef __INT8_TYPE__ int8_t; | ^~~~~~ /usr/include/x86_64-linux-gnu/bits/stdint-intn.h:27:19: error: conflicting types for 'int64_t'; have '__int64_t' {aka 'long long int'} 27 | typedef __int64_t int64_t; | ^~~~~~~ /home/buildroot/opt/cross/lib/gcc/bpf/13.0.0/include/stdint.h:43:24: note: previous declaration of 'int64_t' with type 'int64_t' {aka 'long int'} 43 | typedef __INT64_TYPE__ int64_t; | ^~~~~~~ make: *** [Makefile:537: /home/buildroot/bpf-next/tools/testing/selftests/bpf/bpf_gcc/test_tc_dtime.o] Error 1 Signed-off-by: James Hilliard <james.hilliard1@gmail.com> Link: https://lore.kernel.org/r/20220826050703.869571-1-james.hilliard1@gmail.comSigned-off-by: Martin KaFai Lau <kafai@fb.com>
-
Benjamin Tissoires authored
This allows to have a better control over maps from the kernel when preloading eBPF programs. Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com> Link: https://lore.kernel.org/r/20220824134055.1328882-8-benjamin.tissoires@redhat.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Benjamin Tissoires authored
Add BPF_MAP_GET_FD_BY_ID and BPF_MAP_DELETE_PROG. Only BPF_MAP_GET_FD_BY_ID needs to be amended to be able to access the bpf pointer either from the userspace or the kernel. Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com> Link: https://lore.kernel.org/r/20220824134055.1328882-7-benjamin.tissoires@redhat.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
- 25 Aug, 2022 4 commits
-
-
Hao Luo authored
bpf_cgroup_iter_order is globally visible but the entries do not have CGROUP prefix. As requested by Andrii, put a CGROUP in the names in bpf_cgroup_iter_order. This patch fixes two previous commits: one introduced the API and the other uses the API in bpf selftest (that is, the selftest cgroup_hierarchical_stats). I tested this patch via the following command: test_progs -t cgroup,iter,btf_dump Fixes: d4ccaf58 ("bpf: Introduce cgroup iter") Fixes: 88886309 ("selftests/bpf: add a selftest for cgroup hierarchical stats collection") Suggested-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Hao Luo <haoluo@google.com> Link: https://lore.kernel.org/r/20220825223936.1865810-1-haoluo@google.comSigned-off-by: Martin KaFai Lau <kafai@fb.com>
-
Eyal Birger authored
The helper value is ABI as defined by enum bpf_func_id. As bpf_helper_defs.h is used for the userpace part, it must be consistent with this enum. Before this change the comments order was used by the bpf_doc script in order to set the helper values defined in the helpers file. When adding new helpers it is very puzzling when the userspace application breaks in weird places if the comment is inserted instead of appended - because the generated helper ABI is incorrect and shifted. This commit sets the helper value to the enum value. In addition it is currently the practice to have the comments appended and kept in the same order as the enum. As such, add an assertion validating the comment order is consistent with enum value. In case a different comments ordering is desired, this assertion can be lifted. Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Quentin Monnet <quentin@isovalent.com> Link: https://lore.kernel.org/bpf/20220824181043.1601429-1-eyal.birger@gmail.com
-
Lam Thai authored
When `data` points to a boolean value, casting it to `int *` is problematic and could lead to a wrong value being passed to `jsonw_bool`. Change the cast to `bool *` instead. Fixes: b12d6ec0 ("bpf: btf: add btf print functionality") Signed-off-by: Lam Thai <lamthai@arista.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Quentin Monnet <quentin@isovalent.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20220824225859.9038-1-lamthai@arista.com
-
Alexei Starovoitov authored
Hao Luo says: ==================== This patch series allows for using bpf to collect hierarchical cgroup stats efficiently by integrating with the rstat framework. The rstat framework provides an efficient way to collect cgroup stats percpu and propagate them through the cgroup hierarchy. The stats are exposed to userspace in textual form by reading files in bpffs, similar to cgroupfs stats by using a cgroup_iter program. cgroup_iter is a type of bpf_iter. It walks over cgroups in four modes: - walking a cgroup's descendants in pre-order. - walking a cgroup's descendants in post-order. - walking a cgroup's ancestors. - process only a single object. When attaching cgroup_iter, one needs to set a cgroup to the iter_link created from attaching. This cgroup can be passed either as a file descriptor or a cgroup id. That cgroup serves as the starting point of the walk. One can also terminate the walk early by returning 1 from the iter program. Note that because walking cgroup hierarchy holds cgroup_mutex, the iter program is called with cgroup_mutex held. ** Background on rstat for stats collection ** (I am using a subscriber analogy that is not commonly used) The rstat framework maintains a tree of cgroups that have updates and which cpus have updates. A subscriber to the rstat framework maintains their own stats. The framework is used to tell the subscriber when and what to flush, for the most efficient stats propagation. The workflow is as follows: - When a subscriber updates a cgroup on a cpu, it informs the rstat framework by calling cgroup_rstat_updated(cgrp, cpu). - When a subscriber wants to read some stats for a cgroup, it asks the rstat framework to initiate a stats flush (propagation) by calling cgroup_rstat_flush(cgrp). - When the rstat framework initiates a flush, it makes callbacks to subscribers to aggregate stats on cpus that have updates, and propagate updates to their parent. Currently, the main subscribers to the rstat framework are cgroup subsystems (e.g. memory, block). This patch series allow bpf programs to become subscribers as well. Patches in this series are organized as follows: * Patches 1-2 introduce cgroup_iter prog, and a selftest. * Patches 3-5 allow bpf programs to integrate with rstat by adding the necessary hook points and kfunc. A comprehensive selftest that demonstrates the entire workflow for using bpf and rstat to efficiently collect and output cgroup stats is added. --- Changelog: v8 -> v9: - Make UNSPEC (an invalid option) as the default order for cgroup_iter. - Use enum for specifying cgroup_iter order, instead of u32. - Add BPF_ITER_RESHCED to cgroup_iter. - Add cgroup_hierarchical_stats to s390x denylist. v7 -> v8: - Removed the confusing BPF_ITER_DEFAULT (Andrii) - s/SELF/SELF_ONLY/g - Fixed typo (e.g. outputing) (Andrii) - Use "descendants_pre", "descendants_post" etc. instead of "pre", "post" (Andrii) v6 -> v7: - Updated commit/comments in cgroup_iter for read() behavior (Yonghong) - Extracted BPF_ITER_SELF and other options out of cgroup_iter, so that they can be used in other iters. Also renamed them. (Andrii) - Supports both cgroup_fd and cgroup_id when specifying target cgroup. (Andrii) - Avoided using macro for formatting expected output in cgroup_iter selftest. (Andrii) - Applied 'static' on all vars and functions in cgroup_iter selftest. (Andrii) - Fixed broken buf reading in cgroup_iter selftest. (Andrii) - Switched to use bpf_link__destroy() unconditionally. (Andrii) - Removed 'volatile' for non-const global vars in selftests. (Andrii) - Started using bpf_core_enum_value() to get memory_cgrp_id. (Andrii) v5 -> v6: - Rebased on bpf-next - Tidy up cgroup_hierarchical_stats test (Andrii) * 'static' and 'inline' * avoid using libbpf_get_error() * string literals of cgroup paths. - Rename patch 8/8 to 'selftests/bpf' (Yonghong) - Fix cgroup_iter comments (e.g. PAGE_SIZE and uapi) (Yonghong) - Make sure further read() returns OK after previous read() finished properly (Yonghong) - Release cgroup_mutex before the last call of show() (Kumar) v4 -> v5: - Rebased on top of new kfunc flags infrastructure, updated patch 1 and patch 6 accordingly. - Added docs for sleepable kfuncs. v3 -> v4: - cgroup_iter: * reorder fields in bpf_link_info to avoid break uapi (Yonghong) * comment the behavior when cgroup_fd=0 (Yonghong) * comment on the limit of number of cgroups supported by cgroup_iter. (Yonghong) - cgroup_hierarchical_stats selftest: * Do not return -1 if stats are not found (causes overflow in userspace). * Check if child process failed to join cgroup. * Make buf and path arrays in get_cgroup_vmscan_delay() static. * Increase the test map sizes to accomodate cgroups that are not created by the test. v2 -> v3: - cgroup_iter: * Added conditional compilation of cgroup_iter.c in kernel/bpf/Makefile (kernel test) and dropped the !CONFIG_CGROUP patch. * Added validation of traversal_order when attaching (Yonghong). * Fixed previous wording "two modes" to "three modes" (Yonghong). * Fixed the btf_dump selftest broken by this patch (Yonghong). * Fixed ctx_arg_info[0] to use "PTR_TO_BTF_ID_OR_NULL" instead of "PTR_TO_BTF_ID", because the "cgroup" pointer passed to iter prog can be null. - Use __diag_push to eliminate __weak noinline warning in bpf_rstat_flush(). - cgroup_hierarchical_stats selftest: * Added write_cgroup_file_parent() helper. * Added error handling for failed map updates. * Added null check for cgroup in vmscan_flush. * Fixed the signature of vmscan_[start/end]. * Correctly return error code when attaching trace programs fail. * Make sure all links are destroyed correctly and not leaking in cgroup_hierarchical_stats selftest. * Use memory.reclaim instead of memory.high as a more reliable way to invoke reclaim. * Eliminated sleeps, the test now runs faster. v1 -> v2: - Redesign of cgroup_iter from v1, based on Alexei's idea [1]: * supports walking cgroup subtree. * supports walking ancestors of a cgroup. (Andrii) * supports terminating the walk early. * uses fd instead of cgroup_id as parameter for iter_link. Using fd is a convention in bpf. * gets cgroup's ref at attach time and deref at detach. * brought back cgroup1 support for cgroup_iter. - Squashed the patches adding the rstat flush hook points and kfuncs (Tejun). - Added a comment explaining why bpf_rstat_flush() needs to be weak (Tejun). - Updated the final selftest with the new cgroup_iter design. - Changed CHECKs in the selftest with ASSERTs (Yonghong, Andrii). - Removed empty line at the end of the selftest (Yonghong). - Renamed test files to cgroup_hierarchical_stats.c. - Reordered CGROUP_PATH params order to match struct declaration in the selftest (Michal). - Removed memory_subsys_enabled() and made sure memcg controller enablement checks make sense and are documented (Michal). RFC v2 -> v1: - Instead of introducing a new program type for rstat flushing, add an empty hook point, bpf_rstat_flush(), and use fentry bpf programs to attach to it and flush bpf stats. - Instead of using helpers, use kfuncs for rstat functions. - These changes simplify the patchset greatly, with minimal changes to uapi. RFC v1 -> RFC v2: - Instead of rstat flush programs attach to subsystems, they now attach to rstat (global flushers, not per-subsystem), based on discussions with Tejun. The first patch is entirely rewritten. - Pass cgroup pointers to rstat flushers instead of cgroup ids. This is much more flexibility and less likely to need a uapi update later. - rstat helpers are now only defined if CGROUP_CONFIG. - Most of the code is now only defined if CGROUP_CONFIG and CONFIG_BPF_SYSCALL. - Move rstat helper protos from bpf_base_func_proto() to tracing_prog_func_proto(). - rstat helpers argument (cgroup pointer) is now ARG_PTR_TO_BTF_ID, not ARG_ANYTHING. - Rewrote the selftest to use the cgroup helpers. - Dropped bpf_map_lookup_percpu_elem (already added by Feng). - Dropped patch to support cgroup v1 for cgroup_iter. - Dropped patch to define some cgroup_put() when !CONFIG_CGROUP. The code that calls it is no longer compiled when !CONFIG_CGROUP. cgroup_iter was originally introduced in a different patch series[2]. Hao and I agreed that it fits better as part of this series. RFC v1 of this patch series had the following changes from [2]: - Getting the cgroup's reference at the time at attaching, instead of at the time when iterating. (Yonghong) - Remove .init_seq_private and .fini_seq_private callbacks for cgroup_iter. They are not needed now. (Yonghong) [1] https://lore.kernel.org/bpf/20220520221919.jnqgv52k4ajlgzcl@MBP-98dd607d3435.dhcp.thefacebook.com/ [2] https://lore.kernel.org/lkml/20220225234339.2386398-9-haoluo@google.com/ Hao Luo (2): bpf: Introduce cgroup iter selftests/bpf: Test cgroup_iter. ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-