- 10 Aug, 2022 6 commits
-
-
Alexei Starovoitov authored
The verifier cannot perform sufficient validation of bpf_attr->test.ctx_in pointer, therefore bpf programs should not be allowed to call BPF_PROG_RUN command from within the program. To fix this issue split bpf_sys_bpf() bpf helper into normal kern_sys_bpf() kernel function that can only be used by the kernel light skeleton directly. Reported-by: YiFei Zhu <zhuyifei@google.com> Fixes: b1d18a75 ("bpf: Extend sys_bpf commands for bpf_syscall programs.") Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Xu Kuohai authored
The sparse tool complains as follows: arch/arm64/net/bpf_jit_comp.c:1684:16: warning: incorrect type in assignment (different base types) arch/arm64/net/bpf_jit_comp.c:1684:16: expected unsigned int [usertype] *branch arch/arm64/net/bpf_jit_comp.c:1684:16: got restricted __le32 [usertype] * arch/arm64/net/bpf_jit_comp.c:1700:52: error: subtraction of different types can't work (different base types) arch/arm64/net/bpf_jit_comp.c:1734:29: warning: incorrect type in assignment (different base types) arch/arm64/net/bpf_jit_comp.c:1734:29: expected unsigned int [usertype] * arch/arm64/net/bpf_jit_comp.c:1734:29: got restricted __le32 [usertype] * arch/arm64/net/bpf_jit_comp.c:1918:52: error: subtraction of different types can't work (different base types) This is because the variable branch in function invoke_bpf_prog and the variable branches in function prepare_trampoline are defined as type u32 *, which conflicts with ctx->image's type __le32 *, so sparse complains when assignment or arithmetic operation are performed on these two variables and ctx->image. Since arm64 instructions are always little-endian, change the type of these two variables to __le32 * and call cpu_to_le32() to convert instruction to little-endian before writing it to memory. This is also in line with emit() which internally does cpu_to_le32(), too. Fixes: efc9909f ("bpf, arm64: Add bpf trampoline for arm64") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Xu Kuohai <xukuohai@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Link: https://lore.kernel.org/bpf/20220808040735.1232002-1-xukuohai@huawei.com
-
Alexei Starovoitov authored
Kumar Kartikeya Dwivedi says: ==================== Fix for a bug in prealloc_lru_pop spotted while reading the code, then a test + example that checks whether it is fixed. Changelog: ---------- v2 -> v3: v2: https://lore.kernel.org/bpf/20220809140615.21231-1-memxor@gmail.com * Switch test to use kptr instead of kptr_ref to stabilize test runs * Fix missing lru_bug__destroy (Yonghong) * Collect Acks v1 -> v2: v1: https://lore.kernel.org/bpf/20220806014603.1771-1-memxor@gmail.com * Expand commit log to include summary of the discussion with Yonghong * Make lru_bug selftest serial to not mess up refcount for map_kptr test ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Kumar Kartikeya Dwivedi authored
Add a regression test to check against invalid check_and_init_map_value call inside prealloc_lru_pop. The kptr should not be reset to NULL once we set it after deleting the map element. Hence, we trigger a program that updates the element causing its reuse, and checks whether the unref kptr is reset or not. If it is, prealloc_lru_pop does an incorrect check_and_init_map_value call and the test fails. Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20220809213033.24147-4-memxor@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Kumar Kartikeya Dwivedi authored
The LRU map that is preallocated may have its elements reused while another program holds a pointer to it from bpf_map_lookup_elem. Hence, only check_and_free_fields is appropriate when the element is being deleted, as it ensures proper synchronization against concurrent access of the map value. After that, we cannot call check_and_init_map_value again as it may rewrite bpf_spin_lock, bpf_timer, and kptr fields while they can be concurrently accessed from a BPF program. This is safe to do as when the map entry is deleted, concurrent access is protected against by check_and_free_fields, i.e. an existing timer would be freed, and any existing kptr will be released by it. The program can create further timers and kptrs after check_and_free_fields, but they will eventually be released once the preallocated items are freed on map destruction, even if the item is never reused again. Hence, the deleted item sitting in the free list can still have resources attached to it, and they would never leak. With spin_lock, we never touch the field at all on delete or update, as we may end up modifying the state of the lock. Since the verifier ensures that a bpf_spin_lock call is always paired with bpf_spin_unlock call, the program will eventually release the lock so that on reuse the new user of the value can take the lock. Essentially, for the preallocated case, we must assume that the map value may always be in use by the program, even when it is sitting in the freelist, and handle things accordingly, i.e. use proper synchronization inside check_and_free_fields, and never reinitialize the special fields when it is reused on update. Fixes: 68134668 ("bpf: Add map side support for bpf timers.") Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/r/20220809213033.24147-3-memxor@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Kumar Kartikeya Dwivedi authored
In addition to TC hook, enable these in tracing programs so that they can be used in selftests. Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20220809213033.24147-2-memxor@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
- 08 Aug, 2022 4 commits
-
-
Aijun Sun authored
It is not necessary to allocate contiguous physical memory for BPF program buffer using kcalloc. When the BPF program is large more than memory page size, kcalloc allocates multiple memory pages from buddy system. If the device can not provide sufficient memory, for example in low-end android devices [0], memory allocation for BPF program is likely to fail. Test cases in lib/test_bpf.c all pass on ARM64 QEMU. [0] AndroidTestSuit: page allocation failure: order:4, mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null),cpuset=foreground,mems_allowed=0 Call trace: dump_stack+0xa4/0x114 warn_alloc+0xf8/0x14c __alloc_pages_slowpath+0xac8/0xb14 __alloc_pages_nodemask+0x194/0x3d0 kmalloc_order_trace+0x44/0x1e8 __kmalloc+0x29c/0x66c bpf_int_jit_compile+0x17c/0x568 bpf_prog_select_runtime+0x4c/0x1b0 bpf_prepare_filter+0x5fc/0x6bc bpf_prog_create_from_user+0x118/0x1c0 seccomp_set_mode_filter+0x1c4/0x7cc __do_sys_prctl+0x380/0x1424 __arm64_sys_prctl+0x20/0x2c el0_svc_common+0xc8/0x22c el0_svc_handler+0x1c/0x28 el0_svc+0x8/0x100 Signed-off-by: Aijun Sun <aijun.sun@unisoc.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220804025442.22524-1-aijun.sun@unisoc.com
-
Stanislav Fomichev authored
Apparently, no existing selftest covers it. Add a new one where we load cgroup/bind4 program and attach fentry to it. Calling bpf_obj_get_info_by_fd on the fentry program should return non-zero btf_id/btf_obj_id instead of crashing the kernel. Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20220804201140.1340684-2-sdf@google.com
-
Stanislav Fomichev authored
When attaching to program, the program itself might not be attached to anything (and, hence, might not have attach_btf), so we can't unconditionally use 'prog->aux->dst_prog->aux->attach_btf'. Instead, use bpf_prog_get_target_btf to pick proper target BTF: * when attached to dst_prog, use dst_prog->aux->btf * when attached to kernel btf, use prog->aux->attach_btf Fixes: b79c9fc9 ("bpf: implement BPF_PROG_QUERY for BPF_LSM_CGROUP") Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20220804201140.1340684-1-sdf@google.com
-
Jiri Olsa authored
The btf_sock_ids array needs struct mptcp_sock BTF ID for the bpf_skc_to_mptcp_sock helper. When CONFIG_MPTCP is disabled, the 'struct mptcp_sock' is not defined and resolve_btfids will complain with: [...] BTFIDS vmlinux WARN: resolve_btfids: unresolved symbol mptcp_sock [...] Add an empty definition for struct mptcp_sock when CONFIG_MPTCP is disabled. Fixes: 3bc253c2 ("bpf: Add bpf_skc_to_mptcp_sock_proto") Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20220802163324.1873044-1-jolsa@kernel.org
-
- 05 Aug, 2022 1 commit
-
-
Jiri Olsa authored
We need to release possible hash from trampoline fops object before removing it, otherwise we leak it. Fixes: 00963a2e ("bpf: Support bpf_trampoline on functions with IPMODIFY (e.g. livepatch)") Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/bpf/20220802135651.1794015-1-jolsa@kernel.org
-
- 04 Aug, 2022 11 commits
-
-
Jinghao Jia authored
The bpf_sys_bpf() helper function allows an eBPF program to load another eBPF program from within the kernel. In this case the argument union bpf_attr pointer (as well as the insns and license pointers inside) is a kernel address instead of a userspace address (which is the case of a usual bpf() syscall). To make the memory copying process in the syscall work in both cases, bpfptr_t was introduced to wrap around the pointer and distinguish its origin. Specifically, when copying memory contents from a bpfptr_t, a copy_from_user() is performed in case of a userspace address and a memcpy() is performed for a kernel address. This can lead to problems because the in-kernel pointer is never checked for validity. The problem happens when an eBPF syscall program tries to call bpf_sys_bpf() to load a program but provides a bad insns pointer -- say 0xdeadbeef -- in the bpf_attr union. The helper calls __sys_bpf() which would then call bpf_prog_load() to load the program. bpf_prog_load() is responsible for copying the eBPF instructions to the newly allocated memory for the program; it creates a kernel bpfptr_t for insns and invokes copy_from_bpfptr(). Internally, all bpfptr_t operations are backed by the corresponding sockptr_t operations, which performs direct memcpy() on kernel pointers for copy_from/strncpy_from operations. Therefore, the code is always happy to dereference the bad pointer to trigger a un-handle-able page fault and in turn an oops. However, this is not supposed to happen because at that point the eBPF program is already verified and should not cause a memory error. Sample KASAN trace: [ 25.685056][ T228] ================================================================== [ 25.685680][ T228] BUG: KASAN: user-memory-access in copy_from_bpfptr+0x21/0x30 [ 25.686210][ T228] Read of size 80 at addr 00000000deadbeef by task poc/228 [ 25.686732][ T228] [ 25.686893][ T228] CPU: 3 PID: 228 Comm: poc Not tainted 5.19.0-rc7 #7 [ 25.687375][ T228] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS d55cb5a 04/01/2014 [ 25.687991][ T228] Call Trace: [ 25.688223][ T228] <TASK> [ 25.688429][ T228] dump_stack_lvl+0x73/0x9e [ 25.688747][ T228] print_report+0xea/0x200 [ 25.689061][ T228] ? copy_from_bpfptr+0x21/0x30 [ 25.689401][ T228] ? _printk+0x54/0x6e [ 25.689693][ T228] ? _raw_spin_lock_irqsave+0x70/0xd0 [ 25.690071][ T228] ? copy_from_bpfptr+0x21/0x30 [ 25.690412][ T228] kasan_report+0xb5/0xe0 [ 25.690716][ T228] ? copy_from_bpfptr+0x21/0x30 [ 25.691059][ T228] kasan_check_range+0x2bd/0x2e0 [ 25.691405][ T228] ? copy_from_bpfptr+0x21/0x30 [ 25.691734][ T228] memcpy+0x25/0x60 [ 25.692000][ T228] copy_from_bpfptr+0x21/0x30 [ 25.692328][ T228] bpf_prog_load+0x604/0x9e0 [ 25.692653][ T228] ? cap_capable+0xb4/0xe0 [ 25.692956][ T228] ? security_capable+0x4f/0x70 [ 25.693324][ T228] __sys_bpf+0x3af/0x580 [ 25.693635][ T228] bpf_sys_bpf+0x45/0x240 [ 25.693937][ T228] bpf_prog_f0ec79a5a3caca46_bpf_func1+0xa2/0xbd [ 25.694394][ T228] bpf_prog_run_pin_on_cpu+0x2f/0xb0 [ 25.694756][ T228] bpf_prog_test_run_syscall+0x146/0x1c0 [ 25.695144][ T228] bpf_prog_test_run+0x172/0x190 [ 25.695487][ T228] __sys_bpf+0x2c5/0x580 [ 25.695776][ T228] __x64_sys_bpf+0x3a/0x50 [ 25.696084][ T228] do_syscall_64+0x60/0x90 [ 25.696393][ T228] ? fpregs_assert_state_consistent+0x50/0x60 [ 25.696815][ T228] ? exit_to_user_mode_prepare+0x36/0xa0 [ 25.697202][ T228] ? syscall_exit_to_user_mode+0x20/0x40 [ 25.697586][ T228] ? do_syscall_64+0x6e/0x90 [ 25.697899][ T228] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 25.698312][ T228] RIP: 0033:0x7f6d543fb759 [ 25.698624][ T228] Code: 08 5b 89 e8 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 97 a6 0e 00 f7 d8 64 89 01 48 [ 25.699946][ T228] RSP: 002b:00007ffc3df78468 EFLAGS: 00000287 ORIG_RAX: 0000000000000141 [ 25.700526][ T228] RAX: ffffffffffffffda RBX: 00007ffc3df78628 RCX: 00007f6d543fb759 [ 25.701071][ T228] RDX: 0000000000000090 RSI: 00007ffc3df78478 RDI: 000000000000000a [ 25.701636][ T228] RBP: 00007ffc3df78510 R08: 0000000000000000 R09: 0000000000300000 [ 25.702191][ T228] R10: 0000000000000005 R11: 0000000000000287 R12: 0000000000000000 [ 25.702736][ T228] R13: 00007ffc3df78638 R14: 000055a1584aca68 R15: 00007f6d5456a000 [ 25.703282][ T228] </TASK> [ 25.703490][ T228] ================================================================== [ 25.704050][ T228] Disabling lock debugging due to kernel taint Update copy_from_bpfptr() and strncpy_from_bpfptr() so that: - for a kernel pointer, it uses the safe copy_from_kernel_nofault() and strncpy_from_kernel_nofault() functions. - for a userspace pointer, it performs copy_from_user() and strncpy_from_user(). Fixes: af2ac3e1 ("bpf: Prepare bpf syscall to be used from kernel and user space.") Link: https://lore.kernel.org/bpf/20220727132905.45166-1-jinghao@linux.ibm.com/Signed-off-by: Jinghao Jia <jinghao@linux.ibm.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20220729201713.88688-1-jinghao@linux.ibm.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Paul E. McKenney authored
This patch updates bpf_design_QA.rst to clarify that mentioning a function to the BTF_ID macro does not make that function become part of the Linux kernel's ABI. Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/r/20220802173913.4170192-3-paulmck@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Paul E. McKenney authored
This patch updates bpf_design_QA.rst to clarify that the ability to attach a BPF program to an arbitrary function in the kernel does not make that function become part of the Linux kernel's ABI. [ paulmck: Apply Daniel Borkmann feedback. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/r/20220802173913.4170192-2-paulmck@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Paul E. McKenney authored
This patch updates bpf_design_QA.rst to clarify that the ability to attach a BPF program to a given point in the kernel code via kprobes does not make that attachment point be part of the Linux kernel's ABI. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/r/20220802173913.4170192-1-paulmck@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Yu Xiao authored
The port flag isn't set to `NFP_PORT_CHANGED` when using `ethtool -m DEVNAME` before, so the port state (e.g. interface) cannot be updated. Therefore, it caused that `ethtool -m DEVNAME` sometimes cannot read the correct information. E.g. `ethtool -m DEVNAME` cannot work when load driver before plug in optical module, as the port interface is still NONE without port update. Now update the port state before sending info to NIC to ensure that port interface is correct (latest state). Fixes: 61f7c6f4 ("nfp: implement ethtool get module EEPROM") Reviewed-by: Louis Peens <louis.peens@corigine.com> Signed-off-by: Yu Xiao <yu.xiao@corigine.com> Signed-off-by: Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/20220802093355.69065-1-simon.horman@corigine.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Florian Fainelli authored
Calling mdio_bus_phy_resume() with neither the PHY state machine set to PHY_HALTED nor phydev->mac_managed_pm set to true is a good indication that we can produce a race condition looking like this: CPU0 CPU1 bcmgenet_resume -> phy_resume -> phy_init_hw -> phy_start -> phy_resume phy_start_aneg() mdio_bus_phy_resume -> phy_resume -> phy_write(..., BMCR_RESET) -> usleep() -> phy_read() with the phy_resume() function triggering a PHY behavior that might have to be worked around with (see bf8bfc43 ("net: phy: broadcom: Fix brcm_fet_config_init()") for instance) that ultimately leads to an error reading from the PHY. Fixes: fba863b8 ("net: phy: make PHY PM ops a no-op if MAC driver manages PHY PM") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Link: https://lore.kernel.org/r/20220801233403.258871-1-f.fainelli@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jakub Kicinski authored
Vladimir Oltean says: ==================== Make DSA work with bonding's ARP monitor Since commit 2b86cb82 ("net: dsa: declare lockless TX feature for slave ports") in v5.7, DSA breaks the ARP monitoring logic from the bonding driver, fact which was pointed out by Brian Hutchinson who uses a linux-5.10.y stable kernel. Initially I got lured by other similar hacks introduced for other NETIF_F_LLTX drivers, which, inspired by the bonding documentation, update the trans_start of their TX queues by hand. However Jakub pointed out that this simply isn't a proper solution, and after coming to think more about it, I agree, and it doesn't work properly with DSA nor is it maintainable for the future changes I plan for it (multiple DSA masters in a LAG). I've tested these changes using a DSA-based setup and a veth-based setup, using the active-backup mode and ARP monitoring, with and without arp_validate. Link to v1: https://patchwork.kernel.org/project/netdevbpf/patch/20220715232641.952532-1-vladimir.oltean@nxp.com/ Link to v2: https://patchwork.kernel.org/project/netdevbpf/patch/20220727152000.3616086-1-vladimir.oltean@nxp.com/ ==================== Link: https://lore.kernel.org/r/20220731124108.2810233-1-vladimir.oltean@nxp.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Vladimir Oltean authored
ARP monitoring no longer depends on dev->last_rx or dev_trans_start(), so delete this information. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Vladimir Oltean authored
This reverts commit e66e257a. The veth driver no longer needs these hacks which are slightly detrimential to the fast path performance, because the bonding driver is keeping track of TX times of ARP and NS probes by itself, which it should. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Vladimir Oltean authored
Now that the bonding driver keeps track of the last TX time of ARP and NS probes, we effectively revert the following commits: 32d3e51a ("net_sched: use macvlan real dev trans_start in dev_trans_start()") 07ce76aa ("net_sched: make dev_trans_start return vlan's real dev trans_start") Note that the approach of continuing to hack at this function would not get us very far, hence the desire to take a different approach. DSA is also a virtual device that uses NETIF_F_LLTX, but there, many uppers share the same lower (DSA master, i.e. the physical host port of a switch). By making dev_trans_start() on a DSA interface return the dev_trans_start() of the master, we effectively assume that all other DSA interfaces are silent, otherwise this corrupts the validity of the probe timestamp data from the bonding driver's perspective. Furthermore, the hacks didn't take into consideration the fact that the lower interface of @dev may not have been physical either. For example, VLAN over VLAN, or DSA with 2 masters in a LAG. And even furthermore, there are NETIF_F_LLTX devices which are not stacked, like veth. The hack here would not work with those, because it would not have to provide the bonding driver something to chew at all. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Vladimir Oltean authored
The bonding driver piggybacks on time stamps kept by the network stack for the purpose of the netdev TX watchdog, and this is problematic because it does not work with NETIF_F_LLTX devices. It is hard to say why the driver looks at dev_trans_start() of the slave->dev, considering that this is updated even by non-ARP/NS probes sent by us, and even by traffic not sent by us at all (for example PTP on physical slave devices). ARP monitoring in active-backup mode appears to still work even if we track only the last TX time of actual ARP probes. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
- 03 Aug, 2022 18 commits
-
-
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-nextLinus Torvalds authored
Pull networking changes from Paolo Abeni: "Core: - Refactor the forward memory allocation to better cope with memory pressure with many open sockets, moving from a per socket cache to a per-CPU one - Replace rwlocks with RCU for better fairness in ping, raw sockets and IP multicast router. - Network-side support for IO uring zero-copy send. - A few skb drop reason improvements, including codegen the source file with string mapping instead of using macro magic. - Rename reference tracking helpers to a more consistent netdev_* schema. - Adapt u64_stats_t type to address load/store tearing issues. - Refine debug helper usage to reduce the log noise caused by bots. BPF: - Improve socket map performance, avoiding skb cloning on read operation. - Add support for 64 bits enum, to match types exposed by kernel. - Introduce support for sleepable uprobes program. - Introduce support for enum textual representation in libbpf. - New helpers to implement synproxy with eBPF/XDP. - Improve loop performances, inlining indirect calls when possible. - Removed all the deprecated libbpf APIs. - Implement new eBPF-based LSM flavor. - Add type match support, which allow accurate queries to the eBPF used types. - A few TCP congetsion control framework usability improvements. - Add new infrastructure to manipulate CT entries via eBPF programs. - Allow for livepatch (KLP) and BPF trampolines to attach to the same kernel function. Protocols: - Introduce per network namespace lookup tables for unix sockets, increasing scalability and reducing contention. - Preparation work for Wi-Fi 7 Multi-Link Operation (MLO) support. - Add support to forciby close TIME_WAIT TCP sockets via user-space tools. - Significant performance improvement for the TLS 1.3 receive path, both for zero-copy and not-zero-copy. - Support for changing the initial MTPCP subflow priority/backup status - Introduce virtually contingus buffers for sockets over RDMA, to cope better with memory pressure. - Extend CAN ethtool support with timestamping capabilities - Refactor CAN build infrastructure to allow building only the needed features. Driver API: - Remove devlink mutex to allow parallel commands on multiple links. - Add support for pause stats in distributed switch. - Implement devlink helpers to query and flash line cards. - New helper for phy mode to register conversion. New hardware / drivers: - Ethernet DSA driver for the rockchip mt7531 on BPI-R2 Pro. - Ethernet DSA driver for the Renesas RZ/N1 A5PSW switch. - Ethernet DSA driver for the Microchip LAN937x switch. - Ethernet PHY driver for the Aquantia AQR113C EPHY. - CAN driver for the OBD-II ELM327 interface. - CAN driver for RZ/N1 SJA1000 CAN controller. - Bluetooth: Infineon CYW55572 Wi-Fi plus Bluetooth combo device. Drivers: - Intel Ethernet NICs: - i40e: add support for vlan pruning - i40e: add support for XDP framented packets - ice: improved vlan offload support - ice: add support for PPPoE offload - Mellanox Ethernet (mlx5) - refactor packet steering offload for performance and scalability - extend support for TC offload - refactor devlink code to clean-up the locking schema - support stacked vlans for bridge offloads - use TLS objects pool to improve connection rate - Netronome Ethernet NICs (nfp): - extend support for IPv6 fields mangling offload - add support for vepa mode in HW bridge - better support for virtio data path acceleration (VDPA) - enable TSO by default - Microsoft vNIC driver (mana) - add support for XDP redirect - Others Ethernet drivers: - bonding: add per-port priority support - microchip lan743x: extend phy support - Fungible funeth: support UDP segmentation offload and XDP xmit - Solarflare EF100: add support for virtual function representors - MediaTek SoC: add XDP support - Mellanox Ethernet/IB switch (mlxsw): - dropped support for unreleased H/W (XM router). - improved stats accuracy - unified bridge model coversion improving scalability (parts 1-6) - support for PTP in Spectrum-2 asics - Broadcom PHYs - add PTP support for BCM54210E - add support for the BCM53128 internal PHY - Marvell Ethernet switches (prestera): - implement support for multicast forwarding offload - Embedded Ethernet switches: - refactor OcteonTx MAC filter for better scalability - improve TC H/W offload for the Felix driver - refactor the Microchip ksz8 and ksz9477 drivers to share the probe code (parts 1, 2), add support for phylink mac configuration - Other WiFi: - Microchip wilc1000: diable WEP support and enable WPA3 - Atheros ath10k: encapsulation offload support Old code removal: - Neterion vxge ethernet driver: this is untouched since more than 10 years" * tag 'net-next-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1890 commits) doc: sfp-phylink: Fix a broken reference wireguard: selftests: support UML wireguard: allowedips: don't corrupt stack when detecting overflow wireguard: selftests: update config fragments wireguard: ratelimiter: use hrtimer in selftest net/mlx5e: xsk: Discard unaligned XSK frames on striding RQ net: usb: ax88179_178a: Bind only to vendor-specific interface selftests: net: fix IOAM test skip return code net: usb: make USB_RTL8153_ECM non user configurable net: marvell: prestera: remove reduntant code octeontx2-pf: Reduce minimum mtu size to 60 net: devlink: Fix missing mutex_unlock() call net/tls: Remove redundant workqueue flush before destroy net: txgbe: Fix an error handling path in txgbe_probe() net: dsa: Fix spelling mistakes and cleanup code Documentation: devlink: add add devlink-selftests to the table of contents dccp: put dccp_qpolicy_full() and dccp_qpolicy_push() in the same lock net: ionic: fix error check for vlan flags in ionic_set_nic_features() net: ice: fix error NETIF_F_HW_VLAN_CTAG_FILTER check in ice_vsi_sync_fltr() nfp: flower: add support for tunnel offload without key ID ...
-
git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libataLinus Torvalds authored
Pull ATA updates from Damien Le Moal: - Some code refactoring for the pata_hpt37x and pata_hpt3x2n drivers, from Sergei. - Several patches to cleanup in libata-core, libata-scsi and libata-eh code: fixes arguments and variables types, change some functions declaration to static and fix for a typo in a comment. From Sergey and Xiang. - Fix a compilation warning in the pata_macio driver, from me. - A fix for the expected number of resources in the sata_mv driver fix, from Andrew. * tag 'ata-5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata: ata: sata_mv: Fixes expected number of resources now IRQs are gone ata: libata-scsi: fix result type of ata_ioc32() ata: pata_macio: Fix compilation warning ata: libata-eh: fix sloppy result type of ata_internal_cmd_timeout() ata: libata-core: fix sloppy parameter type in ata_exec_internal[_sg]() ata: make ata_port::fastdrain_cnt *unsigned int* ata: libata-eh: fix sloppy result type of ata_eh_nr_in_flight() ata: libata-core: make ata_exec_internal_sg() *static* ata: make transfer mode masks *unsigned int* ata: libata-core: get rid of *else* branches in ata_id_n_sectors() ata: libata-core: fix sloppy typing in ata_id_n_sectors() ata: pata_hpt3x2n: pass base DPLL frequency to hpt3x2n_pci_clock() ata: pata_hpt37x: merge hpt374_read_freq() to hpt37x_pci_clock() ata: pata_hpt37x: factor out hpt37x_pci_clock() ata: pata_hpt37x: move claculating PCI clock from hpt37x_clock_slot() ata: libata: Fix syntax errors in comments
-
git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefsLinus Torvalds authored
Pull zonefs update from Damien Le Moal: "A single change for this cycle to simplify handling of the memory page used as super block buffer during mount (from Fabio)" * tag 'zonefs-5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs: zonefs: Call page_address() on page acquired with GFP_KERNEL flag
-
git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds authored
Pull iomap updates from Darrick Wong: "The most notable change in this first batch is that we no longer schedule pages beyond i_size for writeback, preferring instead to let truncate deal with those pages. Next week, there may be a second pull request to remove iomap_writepage from the other two filesystems (gfs2/zonefs) that use iomap for buffered IO. This follows in the same vein as the recent removal of writepage from XFS, since it hasn't been triggered in a few years; it does nothing during direct reclaim; and as far as the people who examined the patchset can tell, it's moving the codebase in the right direction. However, as it was a late addition to for-next, I'm holding off on that section for another week of testing to see if anyone can come up with a solid reason for holding off in the meantime. Summary: - Skip writeback for pages that are completely beyond EOF - Minor code cleanups" * tag 'iomap-5.20-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: dax: set did_zero to true when zeroing successfully iomap: set did_zero to true when zeroing successfully iomap: skip pages past eof in iomap_do_writepage()
-
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linuxLinus Torvalds authored
Pull affs fix from David Sterba: "One update to AFFS, switching away from the kmap/kmap_atomic API" * tag 'affs-5.20-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: affs: use memcpy_to_page and remove replace kmap_atomic()
-
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linuxLinus Torvalds authored
Pull btrfs updates from David Sterba: "This brings some long awaited changes, the send protocol bump, otherwise lots of small improvements and fixes. The main core part is reworking bio handling, cleaning up the submission and endio and improving error handling. There are some changes outside of btrfs adding helpers or updating API, listed at the end of the changelog. Features: - sysfs: - export chunk size, in debug mode add tunable for setting its size - show zoned among features (was only in debug mode) - show commit stats (number, last/max/total duration) - send protocol updated to 2 - new commands: - ability write larger data chunks than 64K - send raw compressed extents (uses the encoded data ioctls), ie. no decompression on send side, no compression needed on receive side if supported - send 'otime' (inode creation time) among other timestamps - send file attributes (a.k.a file flags and xflags) - this is first version bump, backward compatibility on send and receive side is provided - there are still some known and wanted commands that will be implemented in the near future, another version bump will be needed, however we want to minimize that to avoid causing usability issues - print checksum type and implementation at mount time - don't print some messages at mount (mentioned as people asked about it), we want to print messages namely for new features so let's make some space for that - big metadata - this has been supported for a long time and is not a feature that's worth mentioning - skinny metadata - same reason, set by default by mkfs Performance improvements: - reduced amount of reserved metadata for delayed items - when inserted items can be batched into one leaf - when deleting batched directory index items - when deleting delayed items used for deletion - overall improved count of files/sec, decreased subvolume lock contention - metadata item access bounds checker micro-optimized, with a few percent of improved runtime for metadata-heavy operations - increase direct io limit for read to 256 sectors, improved throughput by 3x on sample workload Notable fixes: - raid56 - reduce parity writes, skip sectors of stripe when there are no data updates - restore reading from on-disk data instead of using stripe cache, this reduces chances to damage correct data due to RMW cycle - refuse to replay log with unknown incompat read-only feature bit set - zoned - fix page locking when COW fails in the middle of allocation - improved tracking of active zones, ZNS drives may limit the number and there are ENOSPC errors due to that limit and not actual lack of space - adjust maximum extent size for zone append so it does not cause late ENOSPC due to underreservation - mirror reading error messages show the mirror number - don't fallback to buffered IO for NOWAIT direct IO writes, we don't have the NOWAIT semantics for buffered io yet - send, fix sending link commands for existing file paths when there are deleted and created hardlinks for same files - repair all mirrors for profiles with more than 1 copy (raid1c34) - fix repair of compressed extents, unify where error detection and repair happen Core changes: - bio completion cleanups - don't double defer compression bios - simplify endio workqueues - add more data to btrfs_bio to avoid allocation for read requests - rework bio error handling so it's same what block layer does, the submission works and errors are consumed in endio - when asynchronous bio offload fails fall back to synchronous checksum calculation to avoid errors under writeback or memory pressure - new trace points - raid56 events - ordered extent operations - super block log_root_transid deprecated (never used) - mixed_backref and big_metadata sysfs feature files removed, they've been default for sufficiently long time, there are no known users and mixed_backref could be confused with mixed_groups Non-btrfs changes, API updates: - minor highmem API update to cover const arguments - switch all kmap/kmap_atomic to kmap_local - remove redundant flush_dcache_page() - address_space_operations::writepage callback removed - add bdev_max_segments() helper" * tag 'for-5.20-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (163 commits) btrfs: don't call btrfs_page_set_checked in finish_compressed_bio_read btrfs: fix repair of compressed extents btrfs: remove the start argument to check_data_csum and export btrfs: pass a btrfs_bio to btrfs_repair_one_sector btrfs: simplify the pending I/O counting in struct compressed_bio btrfs: repair all known bad mirrors btrfs: merge btrfs_dev_stat_print_on_error with its only caller btrfs: join running log transaction when logging new name btrfs: simplify error handling in btrfs_lookup_dentry btrfs: send: always use the rbtree based inode ref management infrastructure btrfs: send: fix sending link commands for existing file paths btrfs: send: introduce recorded_ref_alloc and recorded_ref_free btrfs: zoned: wait until zone is finished when allocation didn't progress btrfs: zoned: write out partially allocated region btrfs: zoned: activate necessary block group btrfs: zoned: activate metadata block group on flush_space btrfs: zoned: disable metadata overcommit for zoned btrfs: zoned: introduce space_info->active_total_bytes btrfs: zoned: finish least available block group on data bg allocation btrfs: let can_allocate_chunk return error ...
-
git://git.kernel.org/pub/scm/linux/kernel/git/efi/efiLinus Torvalds authored
Pull efivars sysfs interface removal from Ard Biesheuvel: "Remove the obsolete 'efivars' sysfs based interface to the EFI variable store, now that all users have moved to the efivarfs pseudo file system, which was created ~10 years ago to address some fundamental shortcomings in the sysfs based driver. Move the 'business logic' related to which EFI variables are important and may affect the boot flow from the efivars support layer into the efivarfs pseudo file system, so it is no longer exposed to other parts of the kernel" * tag 'efi-efivars-removal-for-v5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi: efi: vars: Move efivar caching layer into efivarfs efi: vars: Switch to new wrapper layer efi: vars: Remove deprecated 'efivars' sysfs interface
-
git://git.kernel.org/pub/scm/linux/kernel/git/efi/efiLinus Torvalds authored
Pull EFI updates from Ard Biesheuvel: - Enable mirrored memory for arm64 - Fix up several abuses of the efivar API - Refactor the efivar API in preparation for moving the 'business logic' part of it into efivarfs - Enable ACPI PRM on arm64 * tag 'efi-next-for-v5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi: (24 commits) ACPI: Move PRM config option under the main ACPI config ACPI: Enable Platform Runtime Mechanism(PRM) support on ARM64 ACPI: PRM: Change handler_addr type to void pointer efi: Simplify arch_efi_call_virt() macro drivers: fix typo in firmware/efi/memmap.c efi: vars: Drop __efivar_entry_iter() helper which is no longer used efi: vars: Use locking version to iterate over efivars linked lists efi: pstore: Omit efivars caching EFI varstore access layer efi: vars: Add thin wrapper around EFI get/set variable interface efi: vars: Don't drop lock in the middle of efivar_init() pstore: Add priv field to pstore_record for backend specific use Input: applespi - avoid efivars API and invoke EFI services directly selftests/kexec: remove broken EFI_VARS secure boot fallback check brcmfmac: Switch to appropriate helper to load EFI variable contents iwlwifi: Switch to proper EFI variable store interface media: atomisp_gmin_platform: stop abusing efivar API efi: efibc: avoid efivar API for setting variables efi: avoid efivars layer when loading SSDTs from variables efi: Correct comment on efi_memmap_alloc memblock: Disable mirror feature if kernelcore is not specified ...
-
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds authored
Pull 9p iov_iter fix from Al Viro: "net/9p abuses iov_iter primitives - it attempts to copy _from_ a destination-only iov_iter when it handles Rerror arriving in reply to zero-copy request. Not hard to fix, fortunately. This is a prereq for the iov_iter_get_pages() work in the second part of iov_iter series, ended up in a separate branch" * tag 'pull-work.9p' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: 9p: handling Rerror without copy_from_iter_full()
-
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds authored
Pull copy_to_iter_mc fix from Al Viro: "Backportable fix for copy_to_iter_mc() - the second part of iov_iter work will pretty much overwrite this, but would be much harder to backport" * tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fix short copy handling in copy_mc_pipe_to_iter()
-
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds authored
Pull vfs iov_iter updates from Al Viro: "Part 1 - isolated cleanups and optimizations. One of the goals is to reduce the overhead of using ->read_iter() and ->write_iter() instead of ->read()/->write(). new_sync_{read,write}() has a surprising amount of overhead, in particular inside iocb_flags(). That's the explanation for the beginning of the series is in this pile; it's not directly iov_iter-related, but it's a part of the same work..." * tag 'pull-work.iov_iter-base' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: first_iovec_segment(): just return address iov_iter: massage calling conventions for first_{iovec,bvec}_segment() iov_iter: first_{iovec,bvec}_segment() - simplify a bit iov_iter: lift dealing with maxpages out of first_{iovec,bvec}_segment() iov_iter_get_pages{,_alloc}(): cap the maxsize with MAX_RW_COUNT iov_iter_bvec_advance(): don't bother with bvec_iter copy_page_{to,from}_iter(): switch iovec variants to generic keep iocb_flags() result cached in struct file iocb: delay evaluation of IS_SYNC(...) until we want to check IOCB_DSYNC struct file: use anonymous union member for rcuhead and llist btrfs: use IOMAP_DIO_NOSYNC teach iomap_dio_rw() to suppress dsync No need of likely/unlikely on calls of check_copy_size()
-
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds authored
Pull vfs dcache updates from Al Viro: "The main part here is making parallel lookups safe for RT - making sure preemption is disabled in start_dir_add()/ end_dir_add() sections (on non-RT it's automatic, on RT it needs to to be done explicitly) and moving wakeups from __d_lookup_done() inside of such to the end of those sections. Wakeups can be safely delayed for as long as ->d_lock on in-lookup dentry is held; proving that has caught a bug in d_add_ci() that allows memory corruption when sufficiently bogus ntfs (or case-insensitive xfs) image is mounted. Easily fixed, fortunately" * tag 'pull-work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs/dcache: Move wakeup out of i_seq_dir write held region. fs/dcache: Move the wakeup from __d_lookup_done() to the caller. fs/dcache: Disable preemption on i_dir_seq write side on PREEMPT_RT d_add_ci(): make sure we don't miss d_lookup_done()
-
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds authored
Pull vfs lseek updates from Al Viro: "Jason's lseek series. Saner handling of 'lseek should fail with ESPIPE' - this gets rid of the magical no_llseek thing and makes checks consistent. In particular, the ad-hoc "can we do splice via internal pipe" checks got saner (and somewhat more permissive, which is what Jason had been after, AFAICT)" * tag 'pull-work.lseek' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs: remove no_llseek fs: check FMODE_LSEEK to control internal pipe splicing vfio: do not set FMODE_LSEEK flag dma-buf: remove useless FMODE_LSEEK flag fs: do not compare against ->llseek fs: clear or set FMODE_LSEEK based on llseek function
-
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds authored
Pull vfs namei updates from Al Viro: "RCU pathwalk cleanups. Storing sampled ->d_seq of the next dentry in nameidata simplifies life considerably, especially if we delay fetching ->d_inode until step_into()" * tag 'pull-work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: step_into(): move fetching ->d_inode past handle_mounts() lookup_fast(): don't bother with inode follow_dotdot{,_rcu}(): don't bother with inode step_into(): lose inode argument namei: stash the sampled ->d_seq into nameidata namei: move clearing LOOKUP_RCU towards rcu_read_unlock() switch try_to_unlazy_next() to __legitimize_mnt() follow_dotdot{,_rcu}(): change calling conventions namei: get rid of pointless unlikely(read_seqcount_retry(...)) __follow_mount_rcu(): verify that mount_lock remains unchanged
-
git://git.infradead.org/users/willy/pagecacheLinus Torvalds authored
Pull folio updates from Matthew Wilcox: - Fix an accounting bug that made NR_FILE_DIRTY grow without limit when running xfstests - Convert more of mpage to use folios - Remove add_to_page_cache() and add_to_page_cache_locked() - Convert find_get_pages_range() to filemap_get_folios() - Improvements to the read_cache_page() family of functions - Remove a few unnecessary checks of PageError - Some straightforward filesystem conversions to use folios - Split PageMovable users out from address_space_operations into their own movable_operations - Convert aops->migratepage to aops->migrate_folio - Remove nobh support (Christoph Hellwig) * tag 'folio-6.0' of git://git.infradead.org/users/willy/pagecache: (78 commits) fs: remove the NULL get_block case in mpage_writepages fs: don't call ->writepage from __mpage_writepage fs: remove the nobh helpers jfs: stop using the nobh helper ext2: remove nobh support ntfs3: refactor ntfs_writepages mm/folio-compat: Remove migration compatibility functions fs: Remove aops->migratepage() secretmem: Convert to migrate_folio hugetlb: Convert to migrate_folio aio: Convert to migrate_folio f2fs: Convert to filemap_migrate_folio() ubifs: Convert to filemap_migrate_folio() btrfs: Convert btrfs_migratepage to migrate_folio mm/migrate: Add filemap_migrate_folio() mm/migrate: Convert migrate_page() to migrate_folio() nfs: Convert to migrate_folio btrfs: Convert btree_migratepage to migrate_folio mm/migrate: Convert expected_page_refs() to folio_expected_refs() mm/migrate: Convert buffer_migrate_page() to buffer_migrate_folio() ...
-
git://git.infradead.org/users/willy/xarrayLinus Torvalds authored
Pull XArray/IDR updates from Matthew Wilcox: - Add appropriate might_alloc() annotations to the XArray APIs - Document that the IDR is deprecated * tag 'xarray-6.0' of git://git.infradead.org/users/willy/xarray: IDR: Note that the IDR API is deprecated XArray: Add calls to might_alloc()
-
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroupLinus Torvalds authored
Pull cgroup updates from Tejun Heo: "Several core optimizations: - threadgroup_rwsem write locking is skipped when configuring controllers in empty subtrees. Combined with CLONE_INTO_CGROUP, this allows the common static usage pattern to not grab threadgroup_rwsem at all (glibc still doesn't seem ready for CLONE_INTO_CGROUP unfortunately). - threadgroup_rwsem used to be put into non-percpu mode by default due to latency concerns in specific use cases. There's no reason for everyone else to pay for it. Make the behavior optional. - psi no longer allocates memory when disabled. ... along with some code cleanups" * tag 'cgroup-for-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: Skip subtree root in cgroup_update_dfl_csses() cgroup: remove "no" prefixed mount options cgroup: Make !percpu threadgroup_rwsem operations optional cgroup: Add "no" prefixed mount options cgroup: Elide write-locking threadgroup_rwsem when updating csses on an empty subtree cgroup.c: remove redundant check for mixable cgroup in cgroup_migrate_vet_dst cgroup.c: add helper __cset_cgroup_from_root to cleanup duplicated codes psi: dont alloc memory for psi by default
-
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netPaolo Abeni authored
Conflicts: net/ax25/af_ax25.c d7c4c9e0 ("ax25: fix incorrect dev_tracker usage") d62607c3 ("net: rename reference+tracking helpers") drivers/net/netdevsim/fib.c 180a6a3e ("netdevsim: fib: Fix reference count leak on route deletion failure") 012ec02a ("netdevsim: convert driver to use unlocked devlink API during init/fini") Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-