- 26 Feb, 2021 23 commits
-
-
Cong Wang authored
It is now nearly identical to bpf_prog_run_pin_on_cpu() and it has an unused parameter 'psock', so we can just get rid of it and call bpf_prog_run_pin_on_cpu() directly. Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20210223184934.6054-9-xiyou.wangcong@gmail.com
-
Cong Wang authored
It is only used within skmsg.c so can become static. Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20210223184934.6054-8-xiyou.wangcong@gmail.com
-
Cong Wang authored
It is only used within sock_map.c so can become static. Suggested-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20210223184934.6054-7-xiyou.wangcong@gmail.com
-
Cong Wang authored
These two eBPF programs are tied to BPF_SK_SKB_STREAM_PARSER and BPF_SK_SKB_STREAM_VERDICT, rename them to reflect the fact they are only used for TCP. And save the name 'skb_verdict' for general use later. Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Lorenz Bauer <lmb@cloudflare.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20210223184934.6054-6-xiyou.wangcong@gmail.com
-
Cong Wang authored
Currently TCP_SKB_CB() is hard-coded in skmsg code, it certainly does not work for any other non-TCP protocols. We can move them to skb ext, but it introduces a memory allocation on fast path. Fortunately, we only need to a word-size to store all the information, because the flags actually only contains 1 bit so can be just packed into the lowest bit of the "pointer", which is stored as unsigned long. Inside struct sk_buff, '_skb_refdst' can be reused because skb dst is no longer needed after ->sk_data_ready() so we can just drop it. Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20210223184934.6054-5-xiyou.wangcong@gmail.com
-
Cong Wang authored
Currently, we compute ->data_end with a compile-time constant offset of skb. But as Jakub pointed out, we can actually compute it in eBPF JIT code at run-time, so that we can competely get rid of ->data_end. This is similar to skb_shinfo(skb) computation in bpf_convert_shinfo_access(). Suggested-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20210223184934.6054-4-xiyou.wangcong@gmail.com
-
Cong Wang authored
struct sk_psock_parser is embedded in sk_psock, it is unnecessary as skb verdict also uses ->saved_data_ready. We can simply fold these fields into sk_psock, and get rid of ->enabled. Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20210223184934.6054-3-xiyou.wangcong@gmail.com
-
Cong Wang authored
As suggested by John, clean up sockmap related Kconfigs: Reduce the scope of CONFIG_BPF_STREAM_PARSER down to TCP stream parser, to reflect its name. Make the rest sockmap code simply depend on CONFIG_BPF_SYSCALL and CONFIG_INET, the latter is still needed at this point because of TCP/UDP proto update. And leave CONFIG_NET_SOCK_MSG untouched, as it is used by non-sockmap cases. Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Lorenz Bauer <lmb@cloudflare.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20210223184934.6054-2-xiyou.wangcong@gmail.com
-
Hangbin Liu authored
Commit 34b2021c ("bpf: Add BPF-helper for MTU checking") added an extra blank line in bpf helper description. This will make bpf_helpers_doc.py stop building bpf_helper_defs.h immediately after bpf_check_mtu(), which will affect future added functions. Fixes: 34b2021c ("bpf: Add BPF-helper for MTU checking") Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Link: https://lore.kernel.org/bpf/20210223131457.1378978-1-liuhangbin@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Alexei Starovoitov authored
Ciara Loftus says: ==================== This series attempts to improve the xsk selftest framework by: 1. making the default output less verbose 2. adding an optional verbose flag to both the test_xsk.sh script and xdpxceiver app. 3. renaming the debug option in the app to to 'dump-pkts' and add a flag to the test_xsk.sh script which enables the flag in the app. 4. changing how tests are launched - now they are launched from the xdpxceiver app instead of the script. Once the improvements are made, a new set of tests are added which test the xsk statistics. The output of the test script now looks like: ./test_xsk.sh PREREQUISITES: [ PASS ] 1..10 ok 1 PASS: SKB NOPOLL ok 2 PASS: SKB POLL ok 3 PASS: SKB NOPOLL Socket Teardown ok 4 PASS: SKB NOPOLL Bi-directional Sockets ok 5 PASS: SKB NOPOLL Stats ok 6 PASS: DRV NOPOLL ok 7 PASS: DRV POLL ok 8 PASS: DRV NOPOLL Socket Teardown ok 9 PASS: DRV NOPOLL Bi-directional Sockets ok 10 PASS: DRV NOPOLL Stats # Totals: pass:10 fail:0 xfail:0 xpass:0 skip:0 error:0 XSK KSELFTESTS: [ PASS ] v2->v3: * Rename dump-pkts to dump_pkts in test_xsk.sh * Add examples of flag usage to test_xsk.sh v1->v2: * Changed '-d' flag in the shell script to '-D' to be consistent with the xdpxceiver app. * Renamed debug mode to 'dump-pkts' which better reflects the behaviour. * Use libpf APIs instead of calls to ss for configuring xdp on the links * Remove mutex init & destroy for each stats test * Added a description for each of the new statistics tests * Distinguish between exiting due to initialisation failure vs test failure This series applies on commit d310ec03 Ciara Loftus (3): selftests/bpf: expose and rename debug argument selftests/bpf: restructure xsk selftests selftests/bpf: introduce xsk statistics tests ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Ciara Loftus authored
This commit introduces a range of tests to the xsk testsuite for validating xsk statistics. A new test type called 'stats' is added. Within it there are four sub-tests. Each test configures a scenario which should trigger the given error statistic. The test passes if the statistic is successfully incremented. The four statistics for which tests have been created are: 1. rx dropped Increase the UMEM frame headroom to a value which results in insufficient space in the rx buffer for both the packet and the headroom. 2. tx invalid Set the 'len' field of tx descriptors to an invalid value (umem frame size + 1). 3. rx ring full Reduce the size of the RX ring to a fraction of the fill ring size. 4. fill queue empty Do not populate the fill queue and then try to receive pkts. Signed-off-by: Ciara Loftus <ciara.loftus@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/bpf/20210223162304.7450-5-ciara.loftus@intel.com
-
Ciara Loftus authored
Prior to this commit individual xsk tests were launched from the shell script 'test_xsk.sh'. When adding a new test type, two new test configurations had to be added to this file - one for each of the supported XDP 'modes' (skb or drv). Should zero copy support be added to the xsk selftest framework in the future, three new test configurations would need to be added for each new test type. Each new test type also typically requires new CLI arguments for the xdpxceiver program. This commit aims to reduce the overhead of adding new tests, by launching the test configurations from within the xdpxceiver program itself, using simple loops. Every test is run every time the C program is executed. Many of the CLI arguments can be removed as a result. Signed-off-by: Ciara Loftus <ciara.loftus@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/bpf/20210223162304.7450-4-ciara.loftus@intel.com
-
Ciara Loftus authored
Launching xdpxceiver with -D enables what was formerly know as 'debug' mode. Rename this mode to 'dump-pkts' as it better describes the behavior enabled by the option. New usage: ./xdpxceiver .. -D or ./xdpxceiver .. --dump-pkts Also make it possible to pass this flag to the app via the test_xsk.sh shell script like so: ./test_xsk.sh -D Signed-off-by: Ciara Loftus <ciara.loftus@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210223162304.7450-3-ciara.loftus@intel.com
-
Magnus Karlsson authored
Make the xsk tests less verbose by only printing the essentials. Currently, it is hard to see if the tests passed or not due to all the printouts. Move the extra printouts to a verbose option, if further debugging is needed when a problem arises. To run the xsk tests with verbose output: ./test_xsk.sh -v Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Ciara Loftus <ciara.loftus@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/bpf/20210223162304.7450-2-ciara.loftus@intel.com
-
Brendan Jackman authored
This function has become overloaded, it actually does lots of diverse things in a single pass. Rename it to avoid confusion, and add some concise commentary. Signed-off-by: Brendan Jackman <jackmanb@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210217104509.2423183-1-jackmanb@google.com
-
Dmitrii Banshchikov authored
Instead of using integer literal here and there use macro name for better context. Signed-off-by: Dmitrii Banshchikov <me@ubique.spb.ru> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20210225202629.585485-1-me@ubique.spb.ru
-
Alexei Starovoitov authored
Song Liu says: ==================== This set enables task local storage for non-BPF_LSM programs. It is common for tracing BPF program to access per-task data. Currently, these data are stored in hash tables with pid as the key. In bcc/libbpftools [1], 9 out of 23 tools use such hash tables. However, hash table is not ideal for many use case. Task local storage provides better usability and performance for BPF programs. Please refer to 6/6 for some performance comparison of task local storage vs. hash table. Changes v5 => v6: 1. Add inc/dec bpf_task_storage_busy in bpf_local_storage_map_free(). Changes v4 => v5: 1. Fix build w/o CONFIG_NET. (kernel test robot) 2. Remove unnecessary check for !task_storage_ptr(). (Martin) 3. Small changes in commit logs. Changes v3 => v4: 1. Prevent deadlock from recursive calls of bpf_task_storage_[get|delete]. (2/6 checks potential deadlock and fails over, 4/6 adds a selftest). Changes v2 => v3: 1. Make the selftest more robust. (Andrii) 2. Small changes with runqslower. (Andrii) 3. Shortern CC list to make it easy for vger. Changes v1 => v2: 1. Do not allocate task local storage when the task is being freed. 2. Revise the selftest and added a new test for a task being freed. 3. Minor changes in runqslower. ==================== Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Song Liu authored
Replace hashtab with task local storage in runqslower. This improves the performance of these BPF programs. The following table summarizes average runtime of these programs, in nanoseconds: task-local hash-prealloc hash-no-prealloc handle__sched_wakeup 125 340 3124 handle__sched_wakeup_new 2812 1510 2998 handle__sched_switch 151 208 991 Note that, task local storage gives better performance than hashtab for handle__sched_wakeup and handle__sched_switch. On the other hand, for handle__sched_wakeup_new, task local storage is slower than hashtab with prealloc. This is because handle__sched_wakeup_new accesses the data for the first time, so it has to allocate the data for task local storage. Once the initial allocation is done, subsequent accesses, as those in handle__sched_wakeup, are much faster with task local storage. If we disable hashtab prealloc, task local storage is much faster for all 3 functions. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210225234319.336131-7-songliubraving@fb.com
-
Song Liu authored
Update the Makefile to prefer using $(O)/vmlinux, $(KBUILD_OUTPUT)/vmlinux (for selftests) or ../../../vmlinux. These two files should have latest definitions for vmlinux.h. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210225234319.336131-6-songliubraving@fb.com
-
Song Liu authored
Add a test with recursive bpf_task_storage_[get|delete] from fentry programs on bpf_local_storage_lookup and bpf_local_storage_update. Without proper deadlock prevent mechanism, this test would cause deadlock. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210225234319.336131-5-songliubraving@fb.com
-
Song Liu authored
Task local storage is enabled for tracing programs. Add two tests for task local storage without CONFIG_BPF_LSM. The first test stores a value in sys_enter and read it back in sys_exit. The second test checks whether the kernel allows allocating task local storage in exit_creds() (which it should not). Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210225234319.336131-4-songliubraving@fb.com
-
Song Liu authored
BPF helpers bpf_task_storage_[get|delete] could hold two locks: bpf_local_storage_map_bucket->lock and bpf_local_storage->lock. Calling these helpers from fentry/fexit programs on functions in bpf_*_storage.c may cause deadlock on either locks. Prevent such deadlock with a per cpu counter, bpf_task_storage_busy. We need this counter to be global, because the two locks here belong to two different objects: bpf_local_storage_map and bpf_local_storage. If we pick one of them as the owner of the counter, it is still possible to trigger deadlock on the other lock. For example, if bpf_local_storage_map owns the counters, it cannot prevent deadlock on bpf_local_storage->lock when two maps are used. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20210225234319.336131-3-songliubraving@fb.com
-
Song Liu authored
To access per-task data, BPF programs usually creates a hash table with pid as the key. This is not ideal because: 1. The user need to estimate the proper size of the hash table, which may be inaccurate; 2. Big hash tables are slow; 3. To clean up the data properly during task terminations, the user need to write extra logic. Task local storage overcomes these issues and offers a better option for these per-task data. Task local storage is only available to BPF_LSM. Now enable it for tracing programs. Unlike LSM programs, tracing programs can be called in IRQ contexts. Helpers that access task local storage are updated to use raw_spin_lock_irqsave() instead of raw_spin_lock_bh(). Tracing programs can attach to functions on the task free path, e.g. exit_creds(). To avoid allocating task local storage after bpf_task_storage_free(). bpf_task_storage_get() is updated to not allocate new storage when the task is not refcounted (task->usage == 0). Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: KP Singh <kpsingh@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20210225234319.336131-2-songliubraving@fb.com
-
- 24 Feb, 2021 6 commits
-
-
Xuan Zhuo authored
This patch is used to construct skb based on page to save memory copy overhead. This function is implemented based on IFF_TX_SKB_NO_LINEAR. Only the network card priv_flags supports IFF_TX_SKB_NO_LINEAR will use page to directly construct skb. If this feature is not supported, it is still necessary to copy data to construct skb. ---------------- Performance Testing ------------ The test environment is Aliyun ECS server. Test cmd: ``` xdpsock -i eth0 -t -S -s <msg size> ``` Test result data: size 64 512 1024 1500 copy 1916747 1775988 1600203 1440054 page 1974058 1953655 1945463 1904478 percent 3.0% 10.0% 21.58% 32.3% Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20210218204908.5455-6-alobakin@pm.me
-
Alexander Lobakin authored
xsk_generic_xmit() allocates a new skb and then queues it for xmitting. The size of new skb's headroom is desc->len, so it comes to the driver/device with no reserved headroom and/or tailroom. Lots of drivers need some headroom (and sometimes tailroom) to prepend (and/or append) some headers or data, e.g. CPU tags, device-specific headers/descriptors (LSO, TLS etc.), and if case of no available space skb_cow_head() will reallocate the skb. Reallocations are unwanted on fast-path, especially when it comes to XDP, so generic XSK xmit should reserve the spaces declared in dev->needed_headroom and dev->needed tailroom to avoid them. Note on max(NET_SKB_PAD, L1_CACHE_ALIGN(dev->needed_headroom)): Usually, output functions reserve LL_RESERVED_SPACE(dev), which consists of dev->hard_header_len + dev->needed_headroom, aligned by 16. However, on XSK xmit hard header is already here in the chunk, so hard_header_len is not needed. But it'd still be better to align data up to cacheline, while reserving no less than driver requests for headroom. NET_SKB_PAD here is to double-insure there will be no reallocations even when the driver advertises no needed_headroom, but in fact need it (not so rare case). Fixes: 35fcde7f ("xsk: support for Tx") Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20210218204908.5455-5-alobakin@pm.me
-
Xuan Zhuo authored
Virtio net supports the case where the skb linear space is empty, so add priv_flags. Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20210218204908.5455-4-alobakin@pm.me
-
Xuan Zhuo authored
In some cases, we hope to construct skb directly based on the existing memory without copying data. In this case, the page will be placed directly in the skb, and the linear space of skb is empty. But unfortunately, many the network card does not support this operation. For example Mellanox Technologies MT27710 Family [ConnectX-4 Lx] will get the following error message: mlx5_core 0000:3b:00.1 eth1: Error cqe on cqn 0x817, ci 0x8, qn 0x1dbb, opcode 0xd, syndrome 0x1, vendor syndrome 0x68 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000030: 00 00 00 00 60 10 68 01 0a 00 1d bb 00 0f 9f d2 WQE DUMP: WQ size 1024 WQ cur size 0, WQE index 0xf, len: 64 00000000: 00 00 0f 0a 00 1d bb 03 00 00 00 08 00 00 00 00 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00000020: 00 00 00 2b 00 08 00 00 00 00 00 05 9e e3 08 00 00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 mlx5_core 0000:3b:00.1 eth1: ERR CQE on SQ: 0x1dbb So a priv_flag is added here to indicate whether the network card supports this feature. Suggested-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20210218204908.5455-3-alobakin@pm.me
-
Alexander Lobakin authored
This is harmless for now, but can be fatal for future refactors. Fixes: 871b642a ("netdev: introduce ndo_set_rx_headroom") Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20210218204908.5455-2-alobakin@pm.me
-
Grant Seltzer authored
This adds both the CONFIG_DEBUG_INFO_BTF and CONFIG_DEBUG_INFO_BTF_MODULES kernel compile option to output of the bpftool feature command. This is relevant for developers that want to account for data structure definition differences between kernels. Signed-off-by: Grant Seltzer <grantseltzer@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20210222195846.155483-1-grantseltzer@gmail.com
-
- 21 Feb, 2021 11 commits
-
-
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds authored
Pull performance event updates from Ingo Molnar: - Add CPU-PMU support for Intel Sapphire Rapids CPUs - Extend the perf ABI with PERF_SAMPLE_WEIGHT_STRUCT, to offer two-parameter sampling event feedback. Not used yet, but is intended for Golden Cove CPU-PMU, which can provide both the instruction latency and the cache latency information for memory profiling events. - Remove experimental, default-disabled perfmon-v4 counter_freezing support that could only be enabled via a boot option. The hardware is hopelessly broken, we'd like to make sure nobody starts relying on this, as it would only end in tears. - Fix energy/power events on Intel SPR platforms - Simplify the uprobes resume_execution() logic - Misc smaller fixes. * tag 'perf-core-2021-02-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/rapl: Fix psys-energy event on Intel SPR platform perf/x86/rapl: Only check lower 32bits for RAPL energy counters perf/x86/rapl: Add msr mask support perf/x86/kvm: Add Cascade Lake Xeon steppings to isolation_ucodes[] perf/x86/intel: Support CPUID 10.ECX to disable fixed counters perf/x86/intel: Add perf core PMU support for Sapphire Rapids perf/x86/intel: Filter unsupported Topdown metrics event perf/x86/intel: Factor out intel_update_topdown_event() perf/core: Add PERF_SAMPLE_WEIGHT_STRUCT perf/intel: Remove Perfmon-v4 counter_freezing support x86/perf: Use static_call for x86_pmu.guest_get_msrs perf/x86/intel/uncore: With > 8 nodes, get pci bus die id from NUMA info perf/x86/intel/uncore: Store the logical die id instead of the physical die id. x86/kprobes: Do not decode opcode in resume_execution()
-
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds authored
Pull scheduler updates from Ingo Molnar: "Core scheduler updates: - Add CONFIG_PREEMPT_DYNAMIC: this in its current form adds the preempt=none/voluntary/full boot options (default: full), to allow distros to build a PREEMPT kernel but fall back to close to PREEMPT_VOLUNTARY (or PREEMPT_NONE) runtime scheduling behavior via a boot time selection. There's also the /debug/sched_debug switch to do this runtime. This feature is implemented via runtime patching (a new variant of static calls). The scope of the runtime patching can be best reviewed by looking at the sched_dynamic_update() function in kernel/sched/core.c. ( Note that the dynamic none/voluntary mode isn't 100% identical, for example preempt-RCU is available in all cases, plus the preempt count is maintained in all models, which has runtime overhead even with the code patching. ) The PREEMPT_VOLUNTARY/PREEMPT_NONE models, used by the vast majority of distributions, are supposed to be unaffected. - Fix ignored rescheduling after rcu_eqs_enter(). This is a bug that was found via rcutorture triggering a hang. The bug is that rcu_idle_enter() may wake up a NOCB kthread, but this happens after the last generic need_resched() check. Some cpuidle drivers fix it by chance but many others don't. In true 2020 fashion the original bug fix has grown into a 5-patch scheduler/RCU fix series plus another 16 RCU patches to address the underlying issue of missed preemption events. These are the initial fixes that should fix current incarnations of the bug. - Clean up rbtree usage in the scheduler, by providing & using the following consistent set of rbtree APIs: partial-order; less() based: - rb_add(): add a new entry to the rbtree - rb_add_cached(): like rb_add(), but for a rb_root_cached total-order; cmp() based: - rb_find(): find an entry in an rbtree - rb_find_add(): find an entry, and add if not found - rb_find_first(): find the first (leftmost) matching entry - rb_next_match(): continue from rb_find_first() - rb_for_each(): iterate a sub-tree using the previous two - Improve the SMP/NUMA load-balancer: scan for an idle sibling in a single pass. This is a 4-commit series where each commit improves one aspect of the idle sibling scan logic. - Improve the cpufreq cooling driver by getting the effective CPU utilization metrics from the scheduler - Improve the fair scheduler's active load-balancing logic by reducing the number of active LB attempts & lengthen the load-balancing interval. This improves stress-ng mmapfork performance. - Fix CFS's estimated utilization (util_est) calculation bug that can result in too high utilization values Misc updates & fixes: - Fix the HRTICK reprogramming & optimization feature - Fix SCHED_SOFTIRQ raising race & warning in the CPU offlining code - Reduce dl_add_task_root_domain() overhead - Fix uprobes refcount bug - Process pending softirqs in flush_smp_call_function_from_idle() - Clean up task priority related defines, remove *USER_*PRIO and USER_PRIO() - Simplify the sched_init_numa() deduplication sort - Documentation updates - Fix EAS bug in update_misfit_status(), which degraded the quality of energy-balancing - Smaller cleanups" * tag 'sched-core-2021-02-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (51 commits) sched,x86: Allow !PREEMPT_DYNAMIC entry/kvm: Explicitly flush pending rcuog wakeup before last rescheduling point entry: Explicitly flush pending rcuog wakeup before last rescheduling point rcu/nocb: Trigger self-IPI on late deferred wake up before user resume rcu/nocb: Perform deferred wake up before last idle's need_resched() check rcu: Pull deferred rcuog wake up to rcu_eqs_enter() callers sched/features: Distinguish between NORMAL and DEADLINE hrtick sched/features: Fix hrtick reprogramming sched/deadline: Reduce rq lock contention in dl_add_task_root_domain() uprobes: (Re)add missing get_uprobe() in __find_uprobe() smp: Process pending softirqs in flush_smp_call_function_from_idle() sched: Harden PREEMPT_DYNAMIC static_call: Allow module use without exposing static_call_key sched: Add /debug/sched_preempt preempt/dynamic: Support dynamic preempt with preempt= boot option preempt/dynamic: Provide irqentry_exit_cond_resched() static call preempt/dynamic: Provide preempt_schedule[_notrace]() static calls preempt/dynamic: Provide cond_resched() and might_resched() static calls preempt: Introduce CONFIG_PREEMPT_DYNAMIC static_call: Provide DEFINE_STATIC_CALL_RET0() ...
-
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds authored
Pull tlb gather updates from Ingo Molnar: "Theses fix MM (soft-)dirty bit management in the procfs code & clean up the TLB gather API" * tag 'core-mm-2021-02-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/ldt: Use tlb_gather_mmu_fullmm() when freeing LDT page-tables tlb: arch: Remove empty __tlb_remove_tlb_entry() stubs tlb: mmu_gather: Remove start/end arguments from tlb_gather_mmu() tlb: mmu_gather: Introduce tlb_gather_mmu_fullmm() tlb: mmu_gather: Remove unused start/end arguments from tlb_finish_mmu() mm: proc: Invalidate TLB after clearing soft-dirty page state
-
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds authored
Pull locking updates from Ingo Molnar: "Core locking primitives updates: - Remove mutex_trylock_recursive() from the API - no users left - Simplify + constify the futex code a bit Lockdep updates: - Teach lockdep about local_lock_t - Add CONFIG_DEBUG_IRQFLAGS=y debug config option to check for potentially unsafe IRQ mask restoration patterns. (I.e. calling raw_local_irq_restore() with IRQs enabled.) - Add wait context self-tests - Fix graph lock corner case corrupting internal data structures - Fix noinstr annotations LKMM updates: - Simplify the litmus tests - Documentation fixes KCSAN updates: - Re-enable KCSAN instrumentation in lib/random32.c Misc fixes: - Don't branch-trace static label APIs - DocBook fix - Remove stale leftover empty file" * tag 'locking-core-2021-02-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits) checkpatch: Don't check for mutex_trylock_recursive() locking/mutex: Kill mutex_trylock_recursive() s390: Use arch_local_irq_{save,restore}() in early boot code lockdep: Noinstr annotate warn_bogus_irq_restore() locking/lockdep: Avoid unmatched unlock locking/rwsem: Remove empty rwsem.h locking/rtmutex: Add missing kernel-doc markup futex: Remove unneeded gotos futex: Change utime parameter to be 'const ... *' lockdep: report broken irq restoration jump_label: Do not profile branch annotations locking: Add Reviewers locking/selftests: Add local_lock inversion tests locking/lockdep: Exclude local_lock_t from IRQ inversions locking/lockdep: Clean up check_redundant() a bit locking/lockdep: Add a skip() function to __bfs() locking/lockdep: Mark local_lock_t locking/selftests: More granular debug_locks_verbose lockdep/selftest: Add wait context selftests tools/memory-model: Fix typo in klitmus7 compatibility table ...
-
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds authored
Pull RCU updates from Ingo Molnar: "These are the latest RCU updates for v5.12: - Documentation updates. - Miscellaneous fixes. - kfree_rcu() updates: Addition of mem_dump_obj() to provide allocator return addresses to more easily locate bugs. This has a couple of RCU-related commits, but is mostly MM. Was pulled in with akpm's agreement. - Per-callback-batch tracking of numbers of callbacks, which enables better debugging information and smarter reactions to large numbers of callbacks. - The first round of changes to allow CPUs to be runtime switched from and to callback-offloaded state. - CONFIG_PREEMPT_RT-related changes. - RCU CPU stall warning updates. - Addition of polling grace-period APIs for SRCU. - Torture-test and torture-test scripting updates, including a "torture everything" script that runs rcutorture, locktorture, scftorture, rcuscale, and refscale. Plus does an allmodconfig build. - nolibc fixes for the torture tests" * tag 'core-rcu-2021-02-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (130 commits) percpu_ref: Dump mem_dump_obj() info upon reference-count underflow rcu: Make call_rcu() print mem_dump_obj() info for double-freed callback mm: Make mem_obj_dump() vmalloc() dumps include start and length mm: Make mem_dump_obj() handle vmalloc() memory mm: Make mem_dump_obj() handle NULL and zero-sized pointers mm: Add mem_dump_obj() to print source of memory block tools/rcutorture: Fix position of -lgcc in mkinitrd.sh tools/nolibc: Fix position of -lgcc in the documented example tools/nolibc: Emit detailed error for missing alternate syscall number definitions tools/nolibc: Remove incorrect definitions of __ARCH_WANT_* tools/nolibc: Get timeval, timespec and timezone from linux/time.h tools/nolibc: Implement poll() based on ppoll() tools/nolibc: Implement fork() based on clone() tools/nolibc: Make getpgrp() fall back to getpgid(0) tools/nolibc: Make dup2() rely on dup3() when available tools/nolibc: Add the definition for dup() rcutorture: Add rcutree.use_softirq=0 to RUDE01 and TASKS01 torture: Maintain torture-specific set of CPUs-online books torture: Clean up after torture-test CPU hotplugging rcutorture: Make object_debug also double call_rcu() heap object ...
-
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds authored
Pull timer updates from Thomas Gleixner: "Time and timer updates: - Instead of new drivers remove tango, sirf, u300 and atlas drivers - Add suspend/resume support for microchip pit64b - The usual fixes, improvements and cleanups here and there" * tag 'timers-core-2021-02-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timens: Delete no-op time_ns_init() alarmtimer: Update kerneldoc clocksource/drivers/timer-microchip-pit64b: Add clocksource suspend/resume clocksource/drivers/prima: Remove sirf prima driver clocksource/drivers/atlas: Remove sirf atlas driver clocksource/drivers/tango: Remove tango driver clocksource/drivers/u300: Remove the u300 driver dt-bindings: timer: nuvoton: Clarify that interrupt of timer 0 should be specified clocksource/drivers/davinci: Move pr_fmt() before the includes clocksource/drivers/efm32: Drop unused timer code
-
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds authored
Pull irq updates from Thomas Gleixner: "Updates for the irq subsystem: - The usual new irq chip driver (Realtek RTL83xx) - Removal of sirfsoc and tango irq chip drivers - Conversion of the sun6i chip support to hierarchical irq domains - The usual fixes, improvements and cleanups all over the place" * tag 'irq-core-2021-02-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip/imx: IMX_INTMUX should not default to y, unconditionally irqchip/loongson-pch-msi: Use bitmap_zalloc() to allocate bitmap irqchip/csky-mpintc: Prevent selection on unsupported platforms irqchip: Add support for Realtek RTL838x/RTL839x interrupt controller dt-bindings: interrupt-controller: Add Realtek RTL838x/RTL839x support irqchip/ls-extirq: add IRQCHIP_SKIP_SET_WAKE to the irqchip flags genirq: Use new tasklet API for resend_tasklet dt-bindings: qcom,pdc: Add compatible for SM8350 dt-bindings: qcom,pdc: Add compatible for SM8250 irqchip/sun6i-r: Add wakeup support irqchip/sun6i-r: Use a stacked irqchip driver dt-bindings: irq: sun6i-r: Add a compatible for the H3 dt-bindings: irq: sun6i-r: Split the binding from sun7i-nmi irqchip/gic-v3: Fix typos in PMR/RPR SCR_EL3.FIQ handling explanation irqchip: Remove sirfsoc driver irqchip: Remove sigma tango driver
-
git://git.kernel.dk/linux-blockLinus Torvalds authored
Pull io_uring updates from Jens Axboe: "Highlights from this cycles are things like request recycling and task_work optimizations, which net us anywhere from 10-20% of speedups on workloads that mostly are inline. This work was originally done to put io_uring under memcg, which adds considerable overhead. But it's a really nice win as well. Also worth highlighting is the LOOKUP_CACHED work in the VFS, and using it in io_uring. Greatly speeds up the fast path for file opens. Summary: - Put io_uring under memcg protection. We accounted just the rings themselves under rlimit memlock before, now we account everything. - Request cache recycling, persistent across invocations (Pavel, me) - First part of a cleanup/improvement to buffer registration (Bijan) - SQPOLL fixes (Hao) - File registration NULL pointer fixup (Dan) - LOOKUP_CACHED support for io_uring - Disable /proc/thread-self/ for io_uring, like we do for /proc/self - Add Pavel to the io_uring MAINTAINERS entry - Tons of code cleanups and optimizations (Pavel) - Support for skip entries in file registration (Noah)" * tag 'for-5.12/io_uring-2021-02-17' of git://git.kernel.dk/linux-block: (103 commits) io_uring: tctx->task_lock should be IRQ safe proc: don't allow async path resolution of /proc/thread-self components io_uring: kill cached requests from exiting task closing the ring io_uring: add helper to free all request caches io_uring: allow task match to be passed to io_req_cache_free() io-wq: clear out worker ->fs and ->files io_uring: optimise io_init_req() flags setting io_uring: clean io_req_find_next() fast check io_uring: don't check PF_EXITING from syscall io_uring: don't split out consume out of SQE get io_uring: save ctx put/get for task_work submit io_uring: don't duplicate io_req_task_queue() io_uring: optimise SQPOLL mm/files grabbing io_uring: optimise out unlikely link queue io_uring: take compl state from submit state io_uring: inline io_complete_rw_common() io_uring: move res check out of io_rw_reissue() io_uring: simplify iopoll reissuing io_uring: clean up io_req_free_batch_finish() io_uring: move submit side state closer in the ring ...
-
git://git.kernel.dk/linux-blockLinus Torvalds authored
Pull block driver updates from Jens Axboe: - Remove the skd driver. It's been EOL for a long time (Damien) - NVMe pull requests - fix multipath handling of ->queue_rq errors (Chao Leng) - nvmet cleanups (Chaitanya Kulkarni) - add a quirk for buggy Amazon controller (Filippo Sironi) - avoid devm allocations in nvme-hwmon that don't interact well with fabrics (Hannes Reinecke) - sysfs cleanups (Jiapeng Chong) - fix nr_zones for multipath (Keith Busch) - nvme-tcp crash fix for no-data commands (Sagi Grimberg) - nvmet-tcp fixes (Sagi Grimberg) - add a missing __rcu annotation (Christoph) - failed reconnect fixes (Chao Leng) - various tracing improvements (Michal Krakowiak, Johannes Thumshirn) - switch the nvmet-fc assoc_list to use RCU protection (Leonid Ravich) - resync the status codes with the latest spec (Max Gurtovoy) - minor nvme-tcp improvements (Sagi Grimberg) - various cleanups (Rikard Falkeborn, Minwoo Im, Chaitanya Kulkarni, Israel Rukshin) - Floppy O_NDELAY fix (Denis) - MD pull request - raid5 chunk_sectors fix (Guoqing) - Use lore links (Kees) - Use DEFINE_SHOW_ATTRIBUTE for nbd (Liao) - loop lock scaling (Pavel) - mtip32xx PCI fixes (Bjorn) - bcache fixes (Kai, Dongdong) - Misc fixes (Tian, Yang, Guoqing, Joe, Andy) * tag 'for-5.12/drivers-2021-02-17' of git://git.kernel.dk/linux-block: (64 commits) lightnvm: pblk: Replace guid_copy() with export_guid()/import_guid() lightnvm: fix unnecessary NULL check warnings nvme-tcp: fix crash triggered with a dataless request submission block: Replace lkml.org links with lore nbd: Convert to DEFINE_SHOW_ATTRIBUTE nvme: add 48-bit DMA address quirk for Amazon NVMe controllers nvme-hwmon: rework to avoid devm allocation nvmet: remove else at the end of the function nvmet: add nvmet_req_subsys() helper nvmet: use min of device_path and disk len nvmet: use invalid cmd opcode helper nvmet: use invalid cmd opcode helper nvmet: add helper to report invalid opcode nvmet: remove extra variable in id-ns handler nvmet: make nvmet_find_namespace() req based nvmet: return uniform error for invalid ns nvmet: set status to 0 in case for invalid nsid nvmet-fc: add a missing __rcu annotation to nvmet_fc_tgt_assoc.queues nvme-multipath: set nr_zones for zoned namespaces nvmet-tcp: fix potential race of tcp socket closing accept_work ...
-
git://git.kernel.dk/linux-blockLinus Torvalds authored
Pull core block updates from Jens Axboe: "Another nice round of removing more code than what is added, mostly due to Christoph's relentless pursuit of tech debt removal/cleanups. This pull request contains: - Two series of BFQ improvements (Paolo, Jan, Jia) - Block iov_iter improvements (Pavel) - bsg error path fix (Pan) - blk-mq scheduler improvements (Jan) - -EBUSY discard fix (Jan) - bvec allocation improvements (Ming, Christoph) - bio allocation and init improvements (Christoph) - Store bdev pointer in bio instead of gendisk + partno (Christoph) - Block trace point cleanups (Christoph) - hard read-only vs read-only split (Christoph) - Block based swap cleanups (Christoph) - Zoned write granularity support (Damien) - Various fixes/tweaks (Chunguang, Guoqing, Lei, Lukas, Huhai)" * tag 'for-5.12/block-2021-02-17' of git://git.kernel.dk/linux-block: (104 commits) mm: simplify swapdev_block sd_zbc: clear zone resources for non-zoned case block: introduce blk_queue_clear_zone_settings() zonefs: use zone write granularity as block size block: introduce zone_write_granularity limit block: use blk_queue_set_zoned in add_partition() nullb: use blk_queue_set_zoned() to setup zoned devices nvme: cleanup zone information initialization block: document zone_append_max_bytes attribute block: use bi_max_vecs to find the bvec pool md/raid10: remove dead code in reshape_request block: mark the bio as cloned in bio_iov_bvec_set block: set BIO_NO_PAGE_REF in bio_iov_bvec_set block: remove a layer of indentation in bio_iov_iter_get_pages block: turn the nr_iovecs argument to bio_alloc* into an unsigned short block: remove the 1 and 4 vec bvec_slabs entries block: streamline bvec_alloc block: factor out a bvec_alloc_gfp helper block: move struct biovec_slab to bio.c block: reuse BIO_INLINE_VECS for integrity bvecs ...
-
git://git.kernel.dk/linux-blockLinus Torvalds authored
Pull libata updates from Jens Axboe: "Regulartors management addition from Florian, and a trivial change to avoid comma separated statements from Joe" * tag 'for-5.12/libata-2021-02-17' of git://git.kernel.dk/linux-block: ata: Avoid comma separated statements ata: ahci_brcm: Add back regulators management
-