Commits · 07018b57066e0009c0b9945723f30e532d347ed3 · Kirill Smelkov / linux

11 Jul, 2023 5 commits

samples/bpf: syscall_tp: Aarch64 no open syscall · 07018b57

Rong Tao authored Jul 11, 2023

__NR_open never exist on AArch64.
Signed-off-by: Rong Tao <rongtao@cestc.cn>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/tencent_C6AD4AD72BEFE813228FC188905F96C6A506@qq.com

07018b57

libbpf: Remove HASHMAP_INIT static initialization helper · a3e7e6b1

John Sanpe authored Jul 11, 2023

Remove the wrong HASHMAP_INIT. It's not used anywhere in libbpf.
Signed-off-by: John Sanpe <sanpeqf@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230711070712.2064144-1-sanpeqf@gmail.com

a3e7e6b1

libbpf: Fix realloc API handling in zero-sized edge cases · 8a0260db

Andrii Nakryiko authored Jul 10, 2023

realloc() and reallocarray() can either return NULL or a special
non-NULL pointer, if their size argument is zero. This requires a bit
more care to handle NULL-as-valid-result situation differently from
NULL-as-error case. This has caused real issues before ([0]), and just
recently bit again in production when performing bpf_program__attach_usdt().

This patch fixes 4 places that do or potentially could suffer from this
mishandling of NULL, including the reported USDT-related one.

There are many other places where realloc()/reallocarray() is used and
NULL is always treated as an error value, but all those have guarantees
that their size is always non-zero, so those spot don't need any extra
handling.

  [0] d08ab82f ("libbpf: Fix double-free when linker processes empty sections")

Fixes: 999783c8 ("libbpf: Wire up spec management and other arch-independent USDT logic")
Fixes: b63b3c49 ("libbpf: Add bpf_program__set_insns function")
Fixes: 697f104d ("libbpf: Support custom SEC() handlers")
Fixes: b1268826 ("libbpf: Change the order of data and text relocations.")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230711024150.1566433-1-andrii@kernel.org

8a0260db

bpf,docs: Create new standardization subdirectory · 4d496be9

David Vernet authored Jul 10, 2023

The BPF standardization effort is actively underway with the IETF. As
described in the BPF Working Group (WG) charter in [0], there are a
number of proposed documents, some informational and some proposed
standards, that will be drafted as part of the standardization effort.

[0]: https://datatracker.ietf.org/wg/bpf/about/

Though the specific documents that will formally be standardized will
exist as Internet Drafts (I-D) and WG documents in the BPF WG
datatracker page, the source of truth from where those documents will be
generated will reside in the kernel documentation tree (originating in
the bpf-next tree).

Because these documents will be used to generate the I-D and WG
documents which will be standardized with the IETF, they are a bit
special as far as kernel-tree documentation goes:

- They will be dual licensed with LGPL-2.1 OR BSD-2-Clause
- IETF I-D and WG documents (the documents which will actually be
  standardized) will be auto-generated from these documents.

In order to keep things clearly organized in the BPF documentation tree,
and to make it abundantly clear where standards-related documentation
needs to go, we should move standards-relevant documents into a separate
standardization/ subdirectory.
Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230710183027.15132-1-void@manifault.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

4d496be9

Merge branch 'bpftool: Fix skeletons compilation for older kernels' · 19f4b532

Andrii Nakryiko authored Jul 10, 2023

Quentin Monnet says:

====================
At runtime, bpftool may run its own BPF programs to get the pids of
processes referencing BPF programs, or to profile programs. The skeletons
for these programs rely on a vmlinux.h header and may fail to compile when
building bpftool on hosts running older kernels, where some structs or
enums are not defined. In this set, we address this issue by using local
definitions for struct perf_event, struct bpf_perf_link,
BPF_LINK_TYPE_PERF_EVENT (pids.bpf.c) and struct bpf_perf_event_value
(profiler.bpf.c).

This set contains patches 1 to 3 from Alexander Lobakin's series, "bpf:
random unpopular userspace fixes (32 bit et al)" (v2) [0], from April 2022.
An additional patch defines a local version of BPF_LINK_TYPE_PERF_EVENT in
bpftool's pids.bpf.c.

[0] https://lore.kernel.org/bpf/20220421003152.339542-1-alobakin@pm.me/

v2: Fixed description (CO-RE for container_of()) in patch 2.

Cc: Alexander Lobakin <aleksander.lobakin@intel.com>
Cc: Michal Suchánek <msuchanek@suse.de>

Alexander Lobakin (3):
  bpftool: use a local copy of perf_event to fix accessing ::bpf_cookie
  bpftool: define a local bpf_perf_link to fix accessing its fields
  bpftool: use a local bpf_perf_event_value to fix accessing its fields
====================
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

19f4b532

10 Jul, 2023 4 commits

bpftool: Use a local bpf_perf_event_value to fix accessing its fields · 658ac068

Alexander Lobakin authored Jul 07, 2023

Fix the following error when building bpftool:

  CLANG   profiler.bpf.o
  CLANG   pid_iter.bpf.o
skeleton/profiler.bpf.c:18:21: error: invalid application of 'sizeof' to an incomplete type 'struct bpf_perf_event_value'
        __uint(value_size, sizeof(struct bpf_perf_event_value));
                           ^     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
tools/bpf/bpftool/bootstrap/libbpf/include/bpf/bpf_helpers.h:13:39: note: expanded from macro '__uint'
tools/bpf/bpftool/bootstrap/libbpf/include/bpf/bpf_helper_defs.h:7:8: note: forward declaration of 'struct bpf_perf_event_value'
struct bpf_perf_event_value;
       ^

struct bpf_perf_event_value is being used in the kernel only when
CONFIG_BPF_EVENTS is enabled, so it misses a BTF entry then.
Define struct bpf_perf_event_value___local with the
`preserve_access_index` attribute inside the pid_iter BPF prog to
allow compiling on any configs. It is a full mirror of a UAPI
structure, so is compatible both with and w/o CO-RE.
bpf_perf_event_read_value() requires a pointer of the original type,
so a cast is needed.

Fixes: 47c09d6a ("bpftool: Introduce "prog profile" command")
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230707095425.168126-5-quentin@isovalent.com

658ac068

bpftool: Use a local copy of BPF_LINK_TYPE_PERF_EVENT in pid_iter.bpf.c · 44ba7b30

Quentin Monnet authored Jul 07, 2023

In order to allow the BPF program in bpftool's pid_iter.bpf.c to compile
correctly on hosts where vmlinux.h does not define
BPF_LINK_TYPE_PERF_EVENT (running kernel versions lower than 5.15, for
example), define and use a local copy of the enum value. This requires
LLVM 12 or newer to build the BPF program.

Fixes: cbdaf71f ("bpftool: Add bpf_cookie to link output")
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230707095425.168126-4-quentin@isovalent.com

44ba7b30

bpftool: Define a local bpf_perf_link to fix accessing its fields · 67a43462

Alexander Lobakin authored Jul 07, 2023

When building bpftool with !CONFIG_PERF_EVENTS:

skeleton/pid_iter.bpf.c:47:14: error: incomplete definition of type 'struct bpf_perf_link'
        perf_link = container_of(link, struct bpf_perf_link, link);
                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
tools/bpf/bpftool/bootstrap/libbpf/include/bpf/bpf_helpers.h:74:22: note: expanded from macro 'container_of'
                ((type *)(__mptr - offsetof(type, member)));    \
                                   ^~~~~~~~~~~~~~~~~~~~~~
tools/bpf/bpftool/bootstrap/libbpf/include/bpf/bpf_helpers.h:68:60: note: expanded from macro 'offsetof'
 #define offsetof(TYPE, MEMBER)  ((unsigned long)&((TYPE *)0)->MEMBER)
                                                  ~~~~~~~~~~~^
skeleton/pid_iter.bpf.c:44:9: note: forward declaration of 'struct bpf_perf_link'
        struct bpf_perf_link *perf_link;
               ^

&bpf_perf_link is being defined and used only under the ifdef.
Define struct bpf_perf_link___local with the `preserve_access_index`
attribute inside the pid_iter BPF prog to allow compiling on any
configs. CO-RE will substitute it with the real struct bpf_perf_link
accesses later on.
container_of() uses offsetof(), which does the necessary CO-RE
relocation if the field is specified with `preserve_access_index` - as
is the case for struct bpf_perf_link___local.

Fixes: cbdaf71f ("bpftool: Add bpf_cookie to link output")
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230707095425.168126-3-quentin@isovalent.com

67a43462

bpftool: use a local copy of perf_event to fix accessing :: Bpf_cookie · 4cbeeb0d

Alexander Lobakin authored Jul 07, 2023

When CONFIG_PERF_EVENTS is not set, struct perf_event remains empty.
However, the structure is being used by bpftool indirectly via BTF.
This leads to:

skeleton/pid_iter.bpf.c:49:30: error: no member named 'bpf_cookie' in 'struct perf_event'
        return BPF_CORE_READ(event, bpf_cookie);
               ~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~

...

skeleton/pid_iter.bpf.c:49:9: error: returning 'void' from a function with incompatible result type '__u64' (aka 'unsigned long long')
        return BPF_CORE_READ(event, bpf_cookie);
               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Tools and samples can't use any CONFIG_ definitions, so the fields
used there should always be present.
Define struct perf_event___local with the `preserve_access_index`
attribute inside the pid_iter BPF prog to allow compiling on any
configs. CO-RE will substitute it with the real struct perf_event
accesses later on.

Fixes: cbdaf71f ("bpftool: Add bpf_cookie to link output")
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230707095425.168126-2-quentin@isovalent.com

4cbeeb0d

09 Jul, 2023 1 commit

libbpf: only reset sec_def handler when necessary · c628747c

Andrii Nakryiko authored Jul 07, 2023

Don't reset recorded sec_def handler unconditionally on
bpf_program__set_type(). There are two situations where this is wrong.

First, if the program type didn't actually change. In that case original
SEC handler should work just fine.

Second, catch-all custom SEC handler is supposed to work with any BPF
program type and SEC() annotation, so it also doesn't make sense to
reset that.

This patch fixes both issues. This was reported recently in the context
of breaking perf tool, which uses custom catch-all handler for fancy BPF
prologue generation logic. This patch should fix the issue.

  [0] https://lore.kernel.org/linux-perf-users/ab865e6d-06c5-078e-e404-7f90686db50d@amd.com/

Fixes: d6e6286a ("libbpf: disassociate section handler on explicit bpf_program__set_type() call")
Reported-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20230707231156.1711948-1-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

c628747c

07 Jul, 2023 1 commit

selftests/bpf: Correct two typos · 856fe03d

Lu Hongfei authored Jul 07, 2023

When wrapping code, use ';' better than using ',' which is more in line with
the coding habits of most engineers.
Signed-off-by: Lu Hongfei <luhongfei@vivo.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Hou Tao <houtao1@huawei.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20230707081253.34638-1-luhongfei@vivo.com

856fe03d

06 Jul, 2023 10 commits

libbpf: Use available_filter_functions_addrs with multi-kprobes · 56baeeba

Jackie Liu authored Jul 05, 2023

Now that kernel provides a new available_filter_functions_addrs file
which can help us avoid the need to cross-validate
available_filter_functions and kallsyms, we can improve efficiency of
multi-attach kprobes. For example, on my device, the sample program [1]
of start time:

$ sudo ./funccount "tcp_*"

before   after
1.2s     1.0s

  [1]: https://github.com/JackieLiu1/ketones/tree/master/src/funccountSigned-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230705091209.3803873-2-liu.yun@linux.dev

56baeeba

libbpf: Cross-join available_filter_functions and kallsyms for multi-kprobes · 8a3fe76f

Jackie Liu authored Jul 05, 2023

When using regular expression matching with "kprobe multi", it scans all
the functions under "/proc/kallsyms" that can be matched. However, not all
of them can be traced by kprobe.multi. If any one of the functions fails
to be traced, it will result in the failure of all functions. The best
approach is to filter out the functions that cannot be traced to ensure
proper tracking of the functions.

Closes: https://lore.kernel.org/oe-kbuild-all/202307030355.TdXOHklM-lkp@intel.com/Reported-by: kernel test robot <lkp@intel.com>
Suggested-by: Jiri Olsa <jolsa@kernel.org>
Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230705091209.3803873-1-liu.yun@linux.dev

8a3fe76f

selftests/bpf: Bump and validate MAX_SYMS · e76a0143

Björn Töpel authored Jul 06, 2023

BPF tests that load /proc/kallsyms, e.g. bpf_cookie, will perform a
buffer overrun if the number of syms on the system is larger than
MAX_SYMS.

Bump the MAX_SYMS to 400000, and add a runtime check that bails out if
the maximum is reached.
Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20230706142228.1128452-1-bjorn@kernel.org

e76a0143

Merge branch 'bpf: add percpu stats for bpf_map' · b625030c

Alexei Starovoitov authored Jul 06, 2023

Anton Protopopov says:

====================
This series adds a mechanism for maps to populate per-cpu counters on
insertions/deletions. The sum of these counters can be accessed by a new kfunc
from map iterator and tracing programs.

The following patches are present in the series:

  * Patch 1 adds a generic per-cpu counter to struct bpf_map
  * Patch 2 adds a new kfunc to access the sum of per-cpu counters
  * Patch 3 utilizes this mechanism for hash-based maps
  * Patch 4 extends the preloaded map iterator to dump the sum
  * Patch 5 adds a self-test for the change

The reason for adding this functionality in our case (Cilium) is to get signals
about how full some heavy-used maps are and what the actual dynamic profile of
map capacity is. In the case of LRU maps this is impossible to get this
information anyhow else. The original presentation can be found here [1].

  [1] https://lpc.events/event/16/contributions/1368/

v4 -> v5:
* don't pass useless empty opts when creating a link, pass NULL (Hou)
* add a debug message (Hou)
* make code more readable (Alexei)
* remove the selftest which only checked that elem_count != NULL

v3 -> v4:
* fix selftests:
  * added test code for batch map operations
  * added a test for BPF_MAP_TYPE_HASH_OF_MAPS (Hou)
  * added tests for BPF_MAP_TYPE_LRU* with BPF_F_NO_COMMON_LRU (Hou)
  * map_info was called multiple times unnecessarily (Hou)
  * small fixes + some memory leaks (Hou)
* fixed wrong error path for freeing a non-prealloc map (Hou)
* fixed counters for batch delete operations (Hou)

v2 -> v3:
- split commits to better represent update logic (Alexei)
- remove filter from kfunc to allow all tracing programs (Alexei)
- extend selftests (Alexei)

v1 -> v2:
- make the counters generic part of struct bpf_map (Alexei)
- don't use map_info and /proc/self/fdinfo in favor of a kfunc (Alexei)
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

b625030c

selftests/bpf: test map percpu stats · 6c1b8cb6

Anton Protopopov authored Jul 06, 2023

Add a new map test, map_percpu_stats.c, which is checking the correctness of
map's percpu elements counters.  For supported maps the test upserts a number
of elements, checks the correctness of the counters, then deletes all the
elements and checks again that the counters sum drops down to zero.

The following map types are tested:

    * BPF_MAP_TYPE_HASH, BPF_F_NO_PREALLOC
    * BPF_MAP_TYPE_PERCPU_HASH, BPF_F_NO_PREALLOC
    * BPF_MAP_TYPE_HASH,
    * BPF_MAP_TYPE_PERCPU_HASH,
    * BPF_MAP_TYPE_LRU_HASH
    * BPF_MAP_TYPE_LRU_PERCPU_HASH
    * BPF_MAP_TYPE_LRU_HASH, BPF_F_NO_COMMON_LRU
    * BPF_MAP_TYPE_LRU_PERCPU_HASH, BPF_F_NO_COMMON_LRU
    * BPF_MAP_TYPE_HASH_OF_MAPS
Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20230706133932.45883-6-aspsk@isovalent.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

6c1b8cb6

bpf: make preloaded map iterators to display map elements count · 515ee52b

Anton Protopopov authored Jul 06, 2023

Add another column to the /sys/fs/bpf/maps.debug iterator to display
cur_entries, the current number of entries in the map as is returned
by the bpf_map_sum_elem_count kfunc. Also fix formatting.

Example:

    # cat /sys/fs/bpf/maps.debug
      id name             max_entries  cur_entries
       2 iterator.rodata            1            0
     125 cilium_auth_map       524288          666
     126 cilium_runtime_          256            0
     127 cilium_signals            32            0
     128 cilium_node_map        16384         1344
     129 cilium_events             32            0
     ...
Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Link: https://lore.kernel.org/r/20230706133932.45883-5-aspsk@isovalent.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

515ee52b

bpf: populate the per-cpu insertions/deletions counters for hashmaps · 9bc421b6

Anton Protopopov authored Jul 06, 2023

Initialize and utilize the per-cpu insertions/deletions counters for hash-based
maps. Non-trivial changes only apply to the preallocated maps for which the
{inc,dec}_elem_count functions are not called, as there's no need in counting
elements to sustain proper map operations.

To increase/decrease percpu counters for preallocated maps we add raw calls to
the bpf_map_{inc,dec}_elem_count functions so that the impact is minimal. For
dynamically allocated maps we add corresponding calls to the existing
{inc,dec}_elem_count functions.
Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Link: https://lore.kernel.org/r/20230706133932.45883-4-aspsk@isovalent.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

9bc421b6

bpf: add a new kfunc to return current bpf_map elements count · 803370d3

Anton Protopopov authored Jul 06, 2023

A bpf_map_sum_elem_count kfunc was added to simplify getting the sum of the map
per-cpu element counters. If a map doesn't implement the counter, then the
function will always return 0.
Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Link: https://lore.kernel.org/r/20230706133932.45883-3-aspsk@isovalent.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

803370d3

bpf: add percpu stats for bpf_map elements insertions/deletions · 25954730

Anton Protopopov authored Jul 06, 2023

Add a generic percpu stats for bpf_map elements insertions/deletions in order
to keep track of both, the current (approximate) number of elements in a map
and per-cpu statistics on update/delete operations.

To expose these stats a particular map implementation should initialize the
counter and adjust it as needed using the 'bpf_map_*_elem_count' helpers
provided by this commit.
Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Link: https://lore.kernel.org/r/20230706133932.45883-2-aspsk@isovalent.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

25954730

selftests/bpf: Add benchmark for bpf memory allocator · fd283ab1

Hou Tao authored Jul 04, 2023

The benchmark could be used to compare the performance of hash map
operations and the memory usage between different flavors of bpf memory
allocator (e.g., no bpf ma vs bpf ma vs reuse-after-gp bpf ma). It also
could be used to check the performance improvement or the memory saving
provided by optimization.

The benchmark creates a non-preallocated hash map which uses bpf memory
allocator and shows the operation performance and the memory usage of
the hash map under different use cases:
(1) overwrite
Each CPU overwrites nonoverlapping part of hash map. When each CPU
completes overwriting of 64 elements in hash map, it increases the
op_count.
(2) batch_add_batch_del
Each CPU adds then deletes nonoverlapping part of hash map in batch.
When each CPU adds and deletes 64 elements in hash map, it increases
the op_count twice.
(3) add_del_on_diff_cpu
Each two-CPUs pair adds and deletes nonoverlapping part of map
cooperatively. When each CPU adds or deletes 64 elements in hash map,
it will increase the op_count.

The following is the benchmark results when comparing between different
flavors of bpf memory allocator. These tests are conducted on a KVM guest
with 8 CPUs and 16 GB memory. The command line below is used to do all
the following benchmarks:

  ./bench htab-mem --use-case $name ${OPTS} -w3 -d10 -a -p8

These results show that preallocated hash map has both better performance
and smaller memory footprint.

(1) non-preallocated + no bpf memory allocator (v6.0.19)
use kmalloc() + call_rcu

overwrite            per-prod-op: 11.24 ± 0.07k/s, avg mem: 82.64 ± 26.32MiB, peak mem: 119.18MiB
batch_add_batch_del  per-prod-op: 18.45 ± 0.10k/s, avg mem: 50.47 ± 14.51MiB, peak mem: 94.96MiB
add_del_on_diff_cpu  per-prod-op: 14.50 ± 0.03k/s, avg mem: 4.64 ± 0.73MiB, peak mem: 7.20MiB

(2) preallocated
OPTS=--preallocated

overwrite            per-prod-op: 191.42 ± 0.09k/s, avg mem: 1.24 ± 0.00MiB, peak mem: 1.49MiB
batch_add_batch_del  per-prod-op: 221.83 ± 0.17k/s, avg mem: 1.23 ± 0.00MiB, peak mem: 1.49MiB
add_del_on_diff_cpu  per-prod-op: 39.66 ± 0.31k/s, avg mem: 1.47 ± 0.13MiB, peak mem: 1.75MiB

(3) normal bpf memory allocator

overwrite            per-prod-op: 126.59 ± 0.02k/s, avg mem: 2.26 ± 0.00MiB, peak mem: 2.74MiB
batch_add_batch_del  per-prod-op: 83.37 ± 0.20k/s, avg mem: 2.14 ± 0.17MiB, peak mem: 2.74MiB
add_del_on_diff_cpu  per-prod-op: 21.25 ± 0.24k/s, avg mem: 17.50 ± 3.32MiB, peak mem: 28.87MiB
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20230704025039.938914-1-houtao@huaweicloud.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

fd283ab1

05 Jul, 2023 3 commits

selftests/bpf: Honor $(O) when figuring out paths · 21be9e47

Björn Töpel authored Jul 05, 2023

When building the kselftests out-of-tree, e.g. ...

  | make ARCH=riscv CROSS_COMPILE=riscv64-linux-gnu- \
  |   O=/tmp/kselftest headers
  | make ARCH=riscv CROSS_COMPILE=riscv64-linux-gnu- \
  |  O=/tmp/kselftest HOSTCC=gcc FORMAT= \
  |  SKIP_TARGETS="arm64 ia64 powerpc sparc64 x86 sgx" \
  |  -C tools/testing/selftests gen_tar

... the kselftest build would not pick up the correct GENDIR path, and
therefore not including autoconf.h.

Correct that by taking $(O) into consideration when figuring out the
GENDIR path.
Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230705113926.751791-3-bjorn@kernel.org

21be9e47

selftests/bpf: Add F_NEEDS_EFFICIENT_UNALIGNED_ACCESS to some tests · ce1f289f

Björn Töpel authored Jul 05, 2023

Some verifier tests were missing F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
which made the test fail. Add the flag where needed.
Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230705113926.751791-2-bjorn@kernel.org

ce1f289f

bpf: Remove unnecessary ring buffer size check · cf6eeb8f

Hou Tao authored Jul 04, 2023

The theoretical maximum size of ring buffer is about 64GB, but now the
size of ring buffer is specified by max_entries in bpf_attr and its
maximum value is (4GB - 1), and it won't be possible for overflow.

So just remove the unnecessary size check in ringbuf_map_alloc() but
keep the comments for possible extension in future.
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Closes: https://lore.kernel.org/bpf/9c636a63-1f3d-442d-9223-96c2dccb9469@moroto.mountain
Link: https://lore.kernel.org/bpf/20230704074014.216616-1-houtao@huaweicloud.com

cf6eeb8f

30 Jun, 2023 9 commits

Merge branch 'libbpf: add netfilter link attach helper' · c20f9cef

Andrii Nakryiko authored Jun 30, 2023

Florian Westphal says:

====================
v4: address comment from Daniel Xu:
  - use human-readable test names in 2/2

v3: address comments from Andrii:
  - prune verbose error message in 1/2
  - use bpf_link_create internally in 1/2
  - use subtests in patch 2/2

When initial netfilter bpf program type support got added one
suggestion was to extend libbpf with a helper to ease attachment
of nf programs to the hook locations.

Add such a helper and a demo test case that attaches a dummy
program to various combinations.

I tested that the selftest fails when changing the expected
outcome (i.e., set 'success' when it should fail and v.v.).
====================
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

c20f9cef

selftests/bpf: Add bpf_program__attach_netfilter helper test · a94098d4

Florian Westphal authored Jun 28, 2023

Call bpf_program__attach_netfilter() with different
protocol/hook/priority combinations.

Test fails if supposedly-illegal attachments work
(e.g., bogus protocol family, illegal priority and so on) or if a
should-work attachment fails.  Expected output:

 ./test_progs -t netfilter_link_attach
 #145/1   netfilter_link_attach/allzero:OK
 #145/2   netfilter_link_attach/invalid-pf:OK
 #145/3   netfilter_link_attach/invalid-hooknum:OK
 #145/4   netfilter_link_attach/invalid-priority-min:OK
 #145/5   netfilter_link_attach/invalid-priority-max:OK
 #145/6   netfilter_link_attach/invalid-flags:OK
 #145/7   netfilter_link_attach/invalid-inet-not-supported:OK
 #145/8   netfilter_link_attach/attach ipv4:OK
 #145/9   netfilter_link_attach/attach ipv6:OK
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/bpf/20230628152738.22765-3-fw@strlen.de

a94098d4

libbpf: Add netfilter link attach helper · 52364abb

Florian Westphal authored Jun 28, 2023

Add new api function: bpf_program__attach_netfilter.

It takes a bpf program (netfilter type), and a pointer to a option struct
that contains the desired attachment (protocol family, priority, hook
location, ...).

It returns a pointer to a 'bpf_link' structure or NULL on error.

Next patch adds new netfilter_basic test that uses this function to
attach a program to a few pf/hook/priority combinations.

v2: change name and use bpf_link_create.
Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/bpf/CAEf4BzZrmUv27AJp0dDxBDMY_B8e55-wLs8DUKK69vCWsCG_pQ@mail.gmail.com/
Link: https://lore.kernel.org/bpf/CAEf4BzZ69YgrQW7DHCJUT_X+GqMq_ZQQPBwopaJJVGFD5=d5Vg@mail.gmail.com/
Link: https://lore.kernel.org/bpf/20230628152738.22765-2-fw@strlen.de

52364abb

libbpf: Skip modules BTF loading when CAP_SYS_ADMIN is missing · 2d2c9516

Andrea Terzolo authored Jun 26, 2023

If during CO-RE relocations libbpf is not able to find the target type
in the running kernel BTF, it searches for it in modules' BTF.
The downside of this approach is that loading modules' BTF requires
CAP_SYS_ADMIN and this prevents BPF applications from running with more
granular capabilities (e.g. CAP_BPF) when they don't need to search
types into modules' BTF.

This patch skips by default modules' BTF loading phase when
CAP_SYS_ADMIN is missing.
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Co-developed-by: Federico Di Pierro <nierro92@gmail.com>
Signed-off-by: Federico Di Pierro <nierro92@gmail.com>
Signed-off-by: Andrea Terzolo <andreaterzolo3@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/CAGQdkDvYU_e=_NX+6DRkL_-TeH3p+QtsdZwHkmH0w3Fuzw0C4w@mail.gmail.com
Link: https://lore.kernel.org/bpf/20230626093614.21270-1-andreaterzolo3@gmail.com

2d2c9516

selftests/bpf: Verify that the cgroup_skb filters receive expected packets. · 539c7e67

Kui-Feng Lee authored Jun 23, 2023

This test case includes four scenarios:

1. Connect to the server from outside the cgroup and close the connection
   from outside the cgroup.
2. Connect to the server from outside the cgroup and close the connection
   from inside the cgroup.
3. Connect to the server from inside the cgroup and close the connection
   from outside the cgroup.
4. Connect to the server from inside the cgroup and close the connection
   from inside the cgroup.

The test case is to verify that cgroup_skb/{egress, ingress} filters
receive expected packets including SYN, SYN/ACK, ACK, FIN, and FIN/ACK.
Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230624014600.576756-3-kuifeng@meta.com

539c7e67

bpf, net: Check skb ownership against full socket. · 223f5f79

Kui-Feng Lee authored Jun 23, 2023

Check skb ownership of an skb against full sockets instead of request_sock.

The filters were called only if an skb is owned by the sock that the skb is
sent out through. In another words, skb->sk should point to the sock that
it is sending through its egress. However, the filters would miss SYN/ACK
skbs that they are owned by a request_sock but sent through the listener
sock, that is the socket listening incoming connections.

However, the listener socket is also the full socket of the request socket.
We should use the full socket as the owner socket of an skb instead.

What is the ownership check for?
================================

BPF_CGROUP_RUN_PROG_INET_EGRESS() checked sk == skb->sk to ensure the
ownership of an skb. Alexei referred to a mailing list conversation [0]
that took place a few years ago. In that conversation, Daniel Borkmann
stated that:

    Wouldn't that mean however, when you go through stacked devices that
    you'd run the same eBPF cgroup program for skb->sk multiple times?

According to what Daniel said, the ownership check mentioned earlier
presumably prevents multiple calls of egress filters caused by an skb.

A test that reproduce this scenario shows that the BPF cgroup egress
programs can be called multiple times for one skb if this ownership
check is not there. So, we can not just remove this check.

Test Stacked Devices
====================

We use L2TP to build an environment of stacked devices. L2TP (Layer 2
Tunneling Protocol) is a tunneling protocol used to support virtual private
networks (VPNs). It relays encapsulated packets; for example in UDP, to its
peer by using a socket.

Using L2TP, packets are first sent through the IP stack and should then
arrive at an L2TP device. The device will expand its skb header to
encapsulate the packet. The skb will be sent back to the IP stack using
the socket that was made for the L2TP session. After that, the routing
process will occur once more, but this time for a new destination.

We changed tools/testing/selftests/net/l2tp.sh to set up a test environment
using L2TP. The run_ping() function in l2tp.sh is where the main change
occurred.

    run_ping()
    {
        local desc="$1"

        sleep 10
        run_cmd host-1 ${ping6} -s 227 -c 4 -i 10 -I fc00:101::1
        fc00:101::2
        log_test $? 0 "IPv6 route through L2TP tunnel ${desc}"
        sleep 10
    }

The test will use L2TP devices to send PING messages. These messages will
have a message size of 227 bytes as a special label to distinguish them.
This is not an ideal solution, but works.

During the execution of the test script, bpftrace was attached to
ip6_finish_output() and l2tp_xmit_skb():

    bpftrace -e '
      kfunc:ip6_finish_output {
        time("%H:%M:%S: ");
        printf("ip6_finish_output skb=%p skb->len=%d cgroup=%p sk=%p
                skb->sk=%p\n", args->skb, args->skb->len,
               args->sk->sk_cgrp_data.cgroup, args->sk, args->skb->sk); }
      kfunc:l2tp_xmit_skb {
        time("%H:%M:%S: ");
        printf("l2tp_xmit_skb skb=%p sk=%p\n", args->skb,
	       args->session->tunnel->sock); }'

The following is part of the output messages printed by bpftrace:

    16:35:20: ip6_finish_output skb=0xffff888103d8e600 skb->len=275
              cgroup=0xffff88810741f800 sk=0xffff888105f3b900
              skb->sk=0xffff888105f3b900

    16:35:20: l2tp_xmit_skb skb=0xffff888103d8e600 sk=0xffff888103dd6300

    16:35:20: ip6_finish_output skb=0xffff888103d8e600 skb->len=337
              cgroup=0xffff88810741f800 sk=0xffff888103dd6300
              skb->sk=0xffff888105f3b900

    16:35:20: ip6_finish_output skb=0xffff888103d8e600 skb->len=337
              cgroup=(nil) sk=(nil) skb->sk=(nil)

    16:35:20: ip6_finish_output skb=0xffff888103d8e000 skb->len=275
              cgroup=0xffffffff837741d0 sk=0xffff888101fe0000
              skb->sk=0xffff888101fe0000

    16:35:20: l2tp_xmit_skb skb=0xffff888103d8e000 sk=0xffff888103483180

    16:35:20: ip6_finish_output skb=0xffff888103d8e000 skb->len=337
              cgroup=0xffff88810741f800 sk=0xffff888103483180
              skb->sk=0xffff888101fe0000

    16:35:20: ip6_finish_output skb=0xffff888103d8e000 skb->len=337
              cgroup=(nil) sk=(nil) skb->sk=(nil)

The first four entries describe a PING message that was sent using the ping
command, whereas the following four entries describe the response received.
Multiple sockets are used to send one skb, including the socket used by the
L2TP session. This can be observed.

Based on this information, it seems that the ownership check is designed to
avoid multiple calls of egress filters caused by a single skb.

  [0] https://lore.kernel.org/all/58193E9D.7040201@iogearbox.net/Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230624014600.576756-2-kuifeng@meta.com

223f5f79

selftests/bpf: Add test to exercise typedef walking · 2597a25c

Stanislav Fomichev authored Jun 26, 2023

Add new bpf_fentry_test_sinfo with skb_shared_info argument and try to
access frags.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20230626212522.2414485-2-sdf@google.com

2597a25c

bpf: Resolve modifiers when walking structs · 819d4342

Stanislav Fomichev authored Jun 26, 2023

It is impossible to use skb_frag_t in the tracing program. Resolve typedefs
when walking structs.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20230626212522.2414485-1-sdf@google.com

819d4342

selftests/bpf: Fix bpf_nf failure upon test rerun · 17e8e5d6

Daniel Borkmann authored Jun 26, 2023

Alexei reported:

  After fast forwarding bpf-next today bpf_nf test started to fail when
  run twice:

  $ ./test_progs -t bpf_nf
  #17      bpf_nf:OK
  Summary: 1/10 PASSED, 0 SKIPPED, 0 FAILED

  $ ./test_progs -t bpf_nf
  All error logs:
  test_bpf_nf_ct:PASS:test_bpf_nf__open_and_load 0 nsec
  test_bpf_nf_ct:PASS:iptables-legacy -t raw -A PREROUTING -j CONNMARK
  --set-mark 42/0 0 nsec
  (network_helpers.c:102: errno: Address already in use) Failed to bind socket
  test_bpf_nf_ct:FAIL:start_server unexpected start_server: actual -1 < expected 0
  #17/1    bpf_nf/xdp-ct:FAIL
  test_bpf_nf_ct:PASS:test_bpf_nf__open_and_load 0 nsec
  test_bpf_nf_ct:PASS:iptables-legacy -t raw -A PREROUTING -j CONNMARK
  --set-mark 42/0 0 nsec
  (network_helpers.c:102: errno: Address already in use) Failed to bind socket
  test_bpf_nf_ct:FAIL:start_server unexpected start_server: actual -1 < expected 0
  #17/2    bpf_nf/tc-bpf-ct:FAIL
  #17      bpf_nf:FAIL
  Summary: 0/8 PASSED, 0 SKIPPED, 1 FAILED

I was able to locally reproduce as well. Rearrange the connection teardown
so that the client closes its connection first so that we don't need to
linger in TCP time-wait.

Fixes: e81fbd4c ("selftests/bpf: Add existing connection bpf_*_ct_lookup() test")
Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/CAADnVQ+0dnDq_v_vH1EfkacbfGnHANaon7zsw10pMb-D9FS0Pw@mail.gmail.com
Link: https://lore.kernel.org/bpf/20230626131942.5100-1-daniel@iogearbox.net

17e8e5d6

29 Jun, 2023 3 commits

bpf, docs: Fix definition of BPF_NEG operation · 85b0c6d4

Dave Thaler authored Jun 27, 2023

Instruction is an arithmetic negative, not a bitwise inverse.
Signed-off-by: Dave Thaler <dthaler@microsoft.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/bpf/20230627213912.951-1-dthaler1968@googlemail.com

85b0c6d4

bpf: Replace deprecated -target with --target= for Clang · bbaf1ff0

Fangrui Song authored Jun 24, 2023

The -target option has been deprecated since clang 3.4 in 2013. Therefore, use
the preferred --target=bpf form instead. This also matches how we use --target=
in scripts/Makefile.clang.
Signed-off-by: Fangrui Song <maskray@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Quentin Monnet <quentin@isovalent.com>
Link: https://github.com/llvm/llvm-project/commit/274b6f0c87a6a1798de0a68135afc7f95def6277
Link: https://lore.kernel.org/bpf/20230624001856.1903733-1-maskray@google.com

bbaf1ff0

lib/test_bpf: Call page_address() on page acquired with GFP_KERNEL flag · da1a055d

Sumitra Sharma authored Jun 23, 2023

generate_test_data() acquires a page with alloc_page(GFP_KERNEL).
The GFP_KERNEL is typical for kernel-internal allocations. The
caller requires ZONE_NORMAL or a lower zone for direct access.

Therefore the page cannot come from ZONE_HIGHMEM. Thus there's no
need to map it with kmap().

Also, the kmap() is being deprecated in favor of kmap_local_page() [1].

Hence, use a plain page_address() directly.

Since the page passed to the page_address() is not from the highmem
zone, the page_address() function will always return a valid kernel
virtual address and will not return NULL. Hence, remove the check
'if (!ptr)'.

Remove the unused variable 'ptr' and label 'err_free_page'.

  [1] https://lore.kernel.org/all/20220813220034.806698-1-ira.weiny@intel.com/Reported-by: kernel test robot <lkp@intel.com>
Suggested-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Signed-off-by: Sumitra Sharma <sumitraartsy@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Link: https://lore.kernel.org/bpf/20230623151644.GA434468@sumitra.com

da1a055d

28 Jun, 2023 4 commits

Merge tag 'net-next-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next · 3a8a670e

Linus Torvalds authored Jun 28, 2023

Pull networking changes from Jakub Kicinski:
 "WiFi 7 and sendpage changes are the biggest pieces of work for this
  release. The latter will definitely require fixes but I think that we
  got it to a reasonable point.

  Core:

   - Rework the sendpage & splice implementations

     Instead of feeding data into sockets page by page extend sendmsg
     handlers to support taking a reference on the data, controlled by a
     new flag called MSG_SPLICE_PAGES

     Rework the handling of unexpected-end-of-file to invoke an
     additional callback instead of trying to predict what the right
     combination of MORE/NOTLAST flags is

     Remove the MSG_SENDPAGE_NOTLAST flag completely

   - Implement SCM_PIDFD, a new type of CMSG type analogous to
     SCM_CREDENTIALS, but it contains pidfd instead of plain pid

   - Enable socket busy polling with CONFIG_RT

   - Improve reliability and efficiency of reporting for ref_tracker

   - Auto-generate a user space C library for various Netlink families

  Protocols:

   - Allow TCP to shrink the advertised window when necessary, prevent
     sk_rcvbuf auto-tuning from growing the window all the way up to
     tcp_rmem[2]

   - Use per-VMA locking for "page-flipping" TCP receive zerocopy

   - Prepare TCP for device-to-device data transfers, by making sure
     that payloads are always attached to skbs as page frags

   - Make the backoff time for the first N TCP SYN retransmissions
     linear. Exponential backoff is unnecessarily conservative

   - Create a new MPTCP getsockopt to retrieve all info
     (MPTCP_FULL_INFO)

   - Avoid waking up applications using TLS sockets until we have a full
     record

   - Allow using kernel memory for protocol ioctl callbacks, paving the
     way to issuing ioctls over io_uring

   - Add nolocalbypass option to VxLAN, forcing packets to be fully
     encapsulated even if they are destined for a local IP address

   - Make TCPv4 use consistent hash in TIME_WAIT and SYN_RECV. Ensure
     in-kernel ECMP implementation (e.g. Open vSwitch) select the same
     link for all packets. Support L4 symmetric hashing in Open vSwitch

   - PPPoE: make number of hash bits configurable

   - Allow DNS to be overwritten by DHCPACK in the in-kernel DHCP client
     (ipconfig)

   - Add layer 2 miss indication and filtering, allowing higher layers
     (e.g. ACL filters) to make forwarding decisions based on whether
     packet matched forwarding state in lower devices (bridge)

   - Support matching on Connectivity Fault Management (CFM) packets

   - Hide the "link becomes ready" IPv6 messages by demoting their
     printk level to debug

   - HSR: don't enable promiscuous mode if device offloads the proto

   - Support active scanning in IEEE 802.15.4

   - Continue work on Multi-Link Operation for WiFi 7

  BPF:

   - Add precision propagation for subprogs and callbacks. This allows
     maintaining verification efficiency when subprograms are used, or
     in fact passing the verifier at all for complex programs,
     especially those using open-coded iterators

   - Improve BPF's {g,s}setsockopt() length handling. Previously BPF
     assumed the length is always equal to the amount of written data.
     But some protos allow passing a NULL buffer to discover what the
     output buffer *should* be, without writing anything

   - Accept dynptr memory as memory arguments passed to helpers

   - Add routing table ID to bpf_fib_lookup BPF helper

   - Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands

   - Drop bpf_capable() check in BPF_MAP_FREEZE command (used to mark
     maps as read-only)

   - Show target_{obj,btf}_id in tracing link fdinfo

   - Addition of several new kfuncs (most of the names are
     self-explanatory):
      - Add a set of new dynptr kfuncs: bpf_dynptr_adjust(),
        bpf_dynptr_is_null(), bpf_dynptr_is_rdonly(), bpf_dynptr_size()
        and bpf_dynptr_clone().
      - bpf_task_under_cgroup()
      - bpf_sock_destroy() - force closing sockets
      - bpf_cpumask_first_and(), rework bpf_cpumask_any*() kfuncs

  Netfilter:

   - Relax set/map validation checks in nf_tables. Allow checking
     presence of an entry in a map without using the value

   - Increase ip_vs_conn_tab_bits range for 64BIT builds

   - Allow updating size of a set

   - Improve NAT tuple selection when connection is closing

  Driver API:

   - Integrate netdev with LED subsystem, to allow configuring HW
     "offloaded" blinking of LEDs based on link state and activity
     (i.e. packets coming in and out)

   - Support configuring rate selection pins of SFP modules

   - Factor Clause 73 auto-negotiation code out of the drivers, provide
     common helper routines

   - Add more fool-proof helpers for managing lifetime of MDIO devices
     associated with the PCS layer

   - Allow drivers to report advanced statistics related to Time Aware
     scheduler offload (taprio)

   - Allow opting out of VF statistics in link dump, to allow more VFs
     to fit into the message

   - Split devlink instance and devlink port operations

  New hardware / drivers:

   - Ethernet:
      - Synopsys EMAC4 IP support (stmmac)
      - Marvell 88E6361 8 port (5x1GE + 3x2.5GE) switches
      - Marvell 88E6250 7 port switches
      - Microchip LAN8650/1 Rev.B0 PHYs
      - MediaTek MT7981/MT7988 built-in 1GE PHY driver

   - WiFi:
      - Realtek RTL8192FU, 2.4 GHz, b/g/n mode, 2T2R, 300 Mbps
      - Realtek RTL8723DS (SDIO variant)
      - Realtek RTL8851BE

   - CAN:
      - Fintek F81604

  Drivers:

   - Ethernet NICs:
      - Intel (100G, ice):
         - support dynamic interrupt allocation
         - use meta data match instead of VF MAC addr on slow-path
      - nVidia/Mellanox:
         - extend link aggregation to handle 4, rather than just 2 ports
         - spawn sub-functions without any features by default
      - OcteonTX2:
         - support HTB (Tx scheduling/QoS) offload
         - make RSS hash generation configurable
         - support selecting Rx queue using TC filters
      - Wangxun (ngbe/txgbe):
         - add basic Tx/Rx packet offloads
         - add phylink support (SFP/PCS control)
      - Freescale/NXP (enetc):
         - report TAPRIO packet statistics
      - Solarflare/AMD:
         - support matching on IP ToS and UDP source port of outer
           header
         - VxLAN and GENEVE tunnel encapsulation over IPv4 or IPv6
         - add devlink dev info support for EF10

   - Virtual NICs:
      - Microsoft vNIC:
         - size the Rx indirection table based on requested
           configuration
         - support VLAN tagging
      - Amazon vNIC:
         - try to reuse Rx buffers if not fully consumed, useful for ARM
           servers running with 16kB pages
      - Google vNIC:
         - support TCP segmentation of >64kB frames

   - Ethernet embedded switches:
      - Marvell (mv88e6xxx):
         - enable USXGMII (88E6191X)
      - Microchip:
         - lan966x: add support for Egress Stage 0 ACL engine
         - lan966x: support mapping packet priority to internal switch
           priority (based on PCP or DSCP)

   - Ethernet PHYs:
      - Broadcom PHYs:
         - support for Wake-on-LAN for BCM54210E/B50212E
         - report LPI counter
      - Microsemi PHYs: support RGMII delay configuration (VSC85xx)
      - Micrel PHYs: receive timestamp in the frame (LAN8841)
      - Realtek PHYs: support optional external PHY clock
      - Altera TSE PCS: merge the driver into Lynx PCS which it is a
        variant of

   - CAN: Kvaser PCIEcan:
      - support packet timestamping

   - WiFi:
      - Intel (iwlwifi):
         - major update for new firmware and Multi-Link Operation (MLO)
         - configuration rework to drop test devices and split the
           different families
         - support for segmented PNVM images and power tables
         - new vendor entries for PPAG (platform antenna gain) feature
      - Qualcomm 802.11ax (ath11k):
         - Multiple Basic Service Set Identifier (MBSSID) and Enhanced
           MBSSID Advertisement (EMA) support in AP mode
         - support factory test mode
      - RealTek (rtw89):
         - add RSSI based antenna diversity
         - support U-NII-4 channels on 5 GHz band
      - RealTek (rtl8xxxu):
         - AP mode support for 8188f
         - support USB RX aggregation for the newer chips"

* tag 'net-next-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1602 commits)
  net: scm: introduce and use scm_recv_unix helper
  af_unix: Skip SCM_PIDFD if scm->pid is NULL.
  net: lan743x: Simplify comparison
  netlink: Add __sock_i_ino() for __netlink_diag_dump().
  net: dsa: avoid suspicious RCU usage for synced VLAN-aware MAC addresses
  Revert "af_unix: Call scm_recv() only after scm_set_cred()."
  phylink: ReST-ify the phylink_pcs_neg_mode() kdoc
  libceph: Partially revert changes to support MSG_SPLICE_PAGES
  net: phy: mscc: fix packet loss due to RGMII delays
  net: mana: use vmalloc_array and vcalloc
  net: enetc: use vmalloc_array and vcalloc
  ionic: use vmalloc_array and vcalloc
  pds_core: use vmalloc_array and vcalloc
  gve: use vmalloc_array and vcalloc
  octeon_ep: use vmalloc_array and vcalloc
  net: usb: qmi_wwan: add u-blox 0x1312 composition
  perf trace: fix MSG_SPLICE_PAGES build error
  ipvlan: Fix return value of ipvlan_queue_xmit()
  netfilter: nf_tables: fix underflow in chain reference counter
  netfilter: nf_tables: unbind non-anonymous set if rule construction fails
  ...

3a8a670e

Merge tag 'v6.5-rc1-sysctl-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux · 6a8cbd92

Linus Torvalds authored Jun 28, 2023

Pull sysctl updates from Luis Chamberlain:
 "The changes for sysctl are in line with prior efforts to stop usage of
  deprecated routines which incur recursion and also make it hard to
  remove the empty array element in each sysctl array declaration.

  The most difficult user to modify was parport which required a bit of
  re-thinking of how to declare shared sysctls there, Joel Granados has
  stepped up to the plate to do most of this work and eventual removal
  of register_sysctl_table(). That work ended up saving us about 1465
  bytes according to bloat-o-meter. Since we gained a few bloat-o-meter
  karma points I moved two rather small sysctl arrays from
  kernel/sysctl.c leaving us only two more sysctl arrays to move left.

  Most changes have been tested on linux-next for about a month. The
  last straggler patches are a minor parport fix, changes to the sysctl
  kernel selftest so to verify correctness and prevent regressions for
  the future change he made to provide an alternative solution for the
  special sysctl mount point target which was using the now deprecated
  sysctl child element.

  This is all prep work to now finally be able to remove the empty array
  element in all sysctl declarations / registrations which is expected
  to save us a bit of bytes all over the kernel. That work will be
  tested early after v6.5-rc1 is out"

* tag 'v6.5-rc1-sysctl-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
  sysctl: replace child with an enumeration
  sysctl: Remove debugging dump_stack
  test_sysclt: Test for registering a mount point
  test_sysctl: Add an option to prevent test skip
  test_sysctl: Add an unregister sysctl test
  test_sysctl: Group node sysctl test under one func
  test_sysctl: Fix test metadata getters
  parport: plug a sysctl register leak
  sysctl: move security keys sysctl registration to its own file
  sysctl: move umh sysctl registration to its own file
  signal: move show_unhandled_signals sysctl to its own file
  sysctl: remove empty dev table
  sysctl: Remove register_sysctl_table
  sysctl: Refactor base paths registrations
  sysctl: stop exporting register_sysctl_table
  parport: Removed sysctl related defines
  parport: Remove register_sysctl_table from parport_default_proc_register
  parport: Remove register_sysctl_table from parport_device_proc_register
  parport: Remove register_sysctl_table from parport_proc_register
  parport: Move magic number "15" to a define

6a8cbd92

Merge tag 'v6.5-rc1-modules-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux · 4e3c09e9

Linus Torvalds authored Jun 28, 2023

Pull module updates from Luis Chamberlain:
 "The changes queued up for modules are pretty tame, mostly code removal
  of moving of code.

  Only two minor functional changes are made, the only one which stands
  out is Sebastian Andrzej Siewior's simplification of module reference
  counting by removing preempt_disable() and that has been tested on
  linux-next for well over a month without no regressions.

  I'm now, I guess, also a kitchen sink for some kallsyms changes"

[ There was a mis-communication about the concurrent module load changes
  that I had expected to come through Luis despite me authoring the
  patch. So some of the module updates were left hanging in the email
  ether, and I just committed them separately.

  It's my bad - I should have made it more clear that I expected my
  own patches to come through the module tree too. Now they missed
  linux-next, but hopefully that won't cause any issues    - Linus ]

* tag 'v6.5-rc1-modules-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
  kallsyms: make kallsyms_show_value() as generic function
  kallsyms: move kallsyms_show_value() out of kallsyms.c
  kallsyms: remove unsed API lookup_symbol_attrs
  kallsyms: remove unused arch_get_kallsym() helper
  module: Remove preempt_disable() from module reference counting.

4e3c09e9

modules: catch concurrent module loads, treat them as idempotent · 9b9879fc

Linus Torvalds authored May 29, 2023

This is the new-and-improved attempt at avoiding huge memory load spikes
when the user space boot sequence tries to load hundreds (or even
thousands) of redundant duplicate modules in parallel.

See commit 9828ed3f ("module: error out early on concurrent load of
the same module file") for background and an earlier failed attempt that
was reverted.

That earlier attempt just said "concurrently loading the same module is
silly, just open the module file exclusively and return -ETXTBSY if
somebody else is already loading it".

While it is true that concurrent module loads of the same module is
silly, the reason that earlier attempt then failed was that the
concurrently loaded module would often be a prerequisite for another
module.

Thus failing to load the prerequisite would then cause cascading
failures of the other modules, rather than just short-circuiting that
one unnecessary module load.

At the same time, we still really don't want to load the contents of the
same module file hundreds of times, only to then wait for an eventually
successful load, and have everybody else return -EEXIST.

As a result, this takes another approach, and treats concurrent module
loads from the same file as "idempotent" in the inode.  So if one module
load is ongoing, we don't start a new one, but instead just wait for the
first one to complete and return the same return value as it did.

So unlike the first attempt, this does not return early: the intent is
not to speed up the boot, but to avoid a thundering herd problem in
allocating memory (both physical and virtual) for a module more than
once.

Also note that this does change behavior: it used to be that when you
had concurrent loads, you'd have one "winner" that would return success,
and everybody else would return -EEXIST.

In contrast, this idempotent logic goes all Oprah on the problem, and
says "You are a winner! And you are a winner! We are ALL winners".  But
since there's no possible actual real semantic difference between "you
loaded the module" and "somebody else already loaded the module", this
is more of a feel-good change than an actual honest-to-goodness semantic
change.

Of course, any true Johnny-come-latelies that don't get caught in the
concurrency filter will still return -EEXIST.  It's no different from
not even getting a seat at an Oprah taping.  That's life.

See the long thread on the kernel mailing list about this all, which
includes some numbers for memory use before and after the patch.

Link: https://lore.kernel.org/lkml/20230524213620.3509138-1-mcgrof@kernel.org/Reviewed-by: Johan Hovold <johan@kernel.org>
Tested-by: Johan Hovold <johan@kernel.org>
Tested-by: Luis Chamberlain <mcgrof@kernel.org>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Rudi Heitbaum <rudi@heitbaum..com>
Tested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

9b9879fc