- 25 Apr, 2020 10 commits
-
-
Andrii Nakryiko authored
Reorder include paths to ensure that runqslower sources are picking up vmlinux.h, generated by runqslower's own Makefile. When runqslower is built from selftests/bpf, due to current -I$(BPF_INCLUDE) -I$(OUTPUT) ordering, it might pick up not-yet-complete vmlinux.h, generated by selftests Makefile, which could lead to compilation errors like [0]. So ensure that -I$(OUTPUT) goes first and rely on runqslower's Makefile own dependency chain to ensure vmlinux.h is properly completed before source code relying on it is compiled. [0] https://travis-ci.org/github/libbpf/libbpf/jobs/677905925Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200422012407.176303-1-andriin@fb.com
-
Zou Wei authored
Fix the following sparse warning: kernel/bpf/syscall.c:2289:30: warning: symbol 'bpf_link_fops' was not declared. Should it be static? Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Zou Wei <zou_wei@huawei.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/1587609160-117806-1-git-send-email-zou_wei@huawei.com
-
Martin KaFai Lau authored
In the prog cmd, the "-d" option turns on the verifier log. This is missed in the "struct_ops" cmd and this patch fixes it. Fixes: 65c93628 ("bpftool: Add struct_ops support") Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Quentin Monnet <quentin@isovalent.com> Link: https://lore.kernel.org/bpf/20200424182911.1259355-1-kafai@fb.com
-
Toke Høiland-Jørgensen authored
This adds a new selftest that tests the ability to attach an freplace program to a program type that relies on the expected_attach_type of the target program to pass verification. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/158773526831.293902.16011743438619684815.stgit@toke.dk
-
Toke Høiland-Jørgensen authored
For some program types, the verifier relies on the expected_attach_type of the program being verified in the verification process. However, for freplace programs, the attach type was not propagated along with the verifier ops, so the expected_attach_type would always be zero for freplace programs. This in turn caused the verifier to sometimes make the wrong call for freplace programs. For all existing uses of expected_attach_type for this purpose, the result of this was only false negatives (i.e., freplace functions would be rejected by the verifier even though they were valid programs for the target they were replacing). However, should a false positive be introduced, this can lead to out-of-bounds accesses and/or crashes. The fix introduced in this patch is to propagate the expected_attach_type to the freplace program during verification, and reset it after that is done. Fixes: be8704ff ("bpf: Introduce dynamic program extensions") Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/158773526726.293902.13257293296560360508.stgit@toke.dk
-
Andrii Nakryiko authored
Fix bug of not putting bpf_link in LINK_UPDATE command. Also enforce zeroed old_prog_fd if no BPF_F_REPLACE flag is specified. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200424052045.4002963-1-andriin@fb.com
-
Wang YanQing authored
When verifier_zext is true, we don't need to emit code for zero-extension. Fixes: 836256bf ("x32: bpf: eliminate zero extension code-gen") Signed-off-by: Wang YanQing <udknight@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200423050637.GA4029@udknight
-
Luke Nelson authored
The current JIT clobbers the destination register for BPF_JSET BPF_X and BPF_K by using "and" and "or" instructions. This is fine when the destination register is a temporary loaded from a register stored on the stack but not otherwise. This patch fixes the problem (for both BPF_K and BPF_X) by always loading the destination register into temporaries since BPF_JSET should not modify the destination register. This bug may not be currently triggerable as BPF_REG_AX is the only register not stored on the stack and the verifier uses it in a limited way. Fixes: 03f5781b ("bpf, x86_32: add eBPF JIT compiler for ia32") Signed-off-by: Xi Wang <xi.wang@gmail.com> Signed-off-by: Luke Nelson <luke.r.nels@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Wang YanQing <udknight@gmail.com> Link: https://lore.kernel.org/bpf/20200422173630.8351-2-luke.r.nels@gmail.com
-
Luke Nelson authored
The current JIT uses the following sequence to zero-extend into the upper 32 bits of the destination register for BPF_LDX BPF_{B,H,W}, when the destination register is not on the stack: EMIT3(0xC7, add_1reg(0xC0, dst_hi), 0); The problem is that C7 /0 encodes a MOV instruction that requires a 4-byte immediate; the current code emits only 1 byte of the immediate. This means that the first 3 bytes of the next instruction will be treated as the rest of the immediate, breaking the stream of instructions. This patch fixes the problem by instead emitting "xor dst_hi,dst_hi" to clear the upper 32 bits. This fixes the problem and is more efficient than using MOV to load a zero immediate. This bug may not be currently triggerable as BPF_REG_AX is the only register not stored on the stack and the verifier uses it in a limited way, and the verifier implements a zero-extension optimization. But the JIT should avoid emitting incorrect encodings regardless. Fixes: 03f5781b ("bpf, x86_32: add eBPF JIT compiler for ia32") Signed-off-by: Xi Wang <xi.wang@gmail.com> Signed-off-by: Luke Nelson <luke.r.nels@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: H. Peter Anvin (Intel) <hpa@zytor.com> Acked-by: Wang YanQing <udknight@gmail.com> Link: https://lore.kernel.org/bpf/20200422173630.8351-1-luke.r.nels@gmail.com
-
Jakub Wilk authored
The patch fixes: $ scripts/bpf_helpers_doc.py > bpf-helpers.rst $ rst2man bpf-helpers.rst > bpf-helpers.7 bpf-helpers.rst:1105: (WARNING/2) Inline strong start-string without end-string. Signed-off-by: Jakub Wilk <jwilk@jwilk.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Quentin Monnet <quentin@isovalent.com> Link: https://lore.kernel.org/bpf/20200422082324.2030-1-jwilk@jwilk.net
-
- 23 Apr, 2020 1 commit
-
-
David Ahern authored
The commit in the Fixes tag changed get_xdp_id to only return prog_id if flags is 0, but there are other XDP flags than the modes - e.g., XDP_FLAGS_UPDATE_IF_NOEXIST. Since the intention was only to look at MODE flags, clear other ones before checking if flags is 0. Fixes: f07cbad2 ("libbpf: Fix bpf_get_link_xdp_id flags handling") Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrey Ignatov <rdna@fb.com>
-
- 21 Apr, 2020 5 commits
-
-
Luke Nelson authored
This patch adds a test to test_verifier that writes the lower 8 bits of R10 (aka FP) using BPF_B to an array map and reads the result back. The expected behavior is that the result should be the same as first copying R10 to R9, and then storing / loading the lower 8 bits of R9. This test catches a bug that was present in the x86-64 JIT that caused an incorrect encoding for BPF_STX BPF_B when the source operand is R10. Signed-off-by: Xi Wang <xi.wang@gmail.com> Signed-off-by: Luke Nelson <luke.r.nels@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200418232655.23870-2-luke.r.nels@gmail.com
-
Luke Nelson authored
This patch fixes an encoding bug in emit_stx for BPF_B when the source register is BPF_REG_FP. The current implementation for BPF_STX BPF_B in emit_stx saves one REX byte when the operands can be encoded using Mod-R/M alone. The lower 8 bits of registers %rax, %rbx, %rcx, and %rdx can be accessed without using a REX prefix via %al, %bl, %cl, and %dl, respectively. Other registers, (e.g., %rsi, %rdi, %rbp, %rsp) require a REX prefix to use their 8-bit equivalents (%sil, %dil, %bpl, %spl). The current code checks if the source for BPF_STX BPF_B is BPF_REG_1 or BPF_REG_2 (which map to %rdi and %rsi), in which case it emits the required REX prefix. However, it misses the case when the source is BPF_REG_FP (mapped to %rbp). The result is that BPF_STX BPF_B with BPF_REG_FP as the source operand will read from register %ch instead of the correct %bpl. This patch fixes the problem by fixing and refactoring the check on which registers need the extra REX byte. Since no BPF registers map to %rsp, there is no need to handle %spl. Fixes: 62258278 ("net: filter: x86: internal BPF JIT") Signed-off-by: Xi Wang <xi.wang@gmail.com> Signed-off-by: Luke Nelson <luke.r.nels@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200418232655.23870-1-luke.r.nels@gmail.com
-
Jann Horn authored
check_xadd() can cause check_ptr_to_btf_access() to be executed with atype==BPF_READ and value_regno==-1 (meaning "just check whether the access is okay, don't tell me what type it will result in"). Handle that case properly and skip writing type information, instead of indexing into the registers at index -1 and writing into out-of-bounds memory. Note that at least at the moment, you can't actually write through a BTF pointer, so check_xadd() will reject the program after calling check_ptr_to_btf_access with atype==BPF_WRITE; but that's after the verifier has already corrupted memory. This patch assumes that BTF pointers are not available in unprivileged programs. Fixes: 9e15db66 ("bpf: Implement accurate raw_tp context access via BTF") Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200417000007.10734-2-jannh@google.com
-
Jann Horn authored
When check_xadd() verifies an XADD operation on a pointer to a stack slot containing a spilled pointer, check_stack_read() verifies that the read, which is part of XADD, is valid. However, since the placeholder value -1 is passed as `value_regno`, check_stack_read() can only return a binary decision and can't return the type of the value that was read. The intent here is to verify whether the value read from the stack slot may be used as a SCALAR_VALUE; but since check_stack_read() doesn't check the type, and the type information is lost when check_stack_read() returns, this is not enforced, and a malicious user can abuse XADD to leak spilled kernel pointers. Fix it by letting check_stack_read() verify that the value is usable as a SCALAR_VALUE if no type information is passed to the caller. To be able to use __is_pointer_value() in check_stack_read(), move it up. Fix up the expected unprivileged error message for a BPF selftest that, until now, assumed that unprivileged users can use XADD on stack-spilled pointers. This also gives us a test for the behavior introduced in this patch for free. In theory, this could also be fixed by forbidding XADD on stack spills entirely, since XADD is a locked operation (for operations on memory with concurrency) and there can't be any concurrency on the BPF stack; but Alexei has said that he wants to keep XADD on stack slots working to avoid changes to the test suite [1]. The following BPF program demonstrates how to leak a BPF map pointer as an unprivileged user using this bug: // r7 = map_pointer BPF_LD_MAP_FD(BPF_REG_7, small_map), // r8 = launder(map_pointer) BPF_STX_MEM(BPF_DW, BPF_REG_FP, BPF_REG_7, -8), BPF_MOV64_IMM(BPF_REG_1, 0), ((struct bpf_insn) { .code = BPF_STX | BPF_DW | BPF_XADD, .dst_reg = BPF_REG_FP, .src_reg = BPF_REG_1, .off = -8 }), BPF_LDX_MEM(BPF_DW, BPF_REG_8, BPF_REG_FP, -8), // store r8 into map BPF_MOV64_REG(BPF_REG_ARG1, BPF_REG_7), BPF_MOV64_REG(BPF_REG_ARG2, BPF_REG_FP), BPF_ALU64_IMM(BPF_ADD, BPF_REG_ARG2, -4), BPF_ST_MEM(BPF_W, BPF_REG_ARG2, 0, 0), BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem), BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1), BPF_EXIT_INSN(), BPF_STX_MEM(BPF_DW, BPF_REG_0, BPF_REG_8, 0), BPF_MOV64_IMM(BPF_REG_0, 0), BPF_EXIT_INSN() [1] https://lore.kernel.org/bpf/20200416211116.qxqcza5vo2ddnkdq@ast-mbp.dhcp.thefacebook.com/ Fixes: 17a52670 ("bpf: verifier (add verifier core)") Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200417000007.10734-1-jannh@google.com
-
Toke Høiland-Jørgensen authored
When the kernel is built with CONFIG_DEBUG_PER_CPU_MAPS, the cpumap code can trigger a spurious warning if CONFIG_CPUMASK_OFFSTACK is also set. This happens because in this configuration, NR_CPUS can be larger than nr_cpumask_bits, so the initial check in cpu_map_alloc() is not sufficient to guard against hitting the warning in cpumask_check(). Fix this by explicitly checking the supplied key against the nr_cpumask_bits variable before calling cpu_possible(). Fixes: 6710e112 ("bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP") Reported-by: Xiumei Mu <xmu@redhat.com> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Tested-by: Xiumei Mu <xmu@redhat.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20200416083120.453718-1-toke@redhat.com
-
- 20 Apr, 2020 16 commits
-
-
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linuxDavid S. Miller authored
mlx5-fixes-2020-04-20 Signed-off-by: David S. Miller <davem@davemloft.net>
-
Zhu Yanjun authored
In the switchdev mode, when running "cat /sys/class/net/NIC/statistics/tx_packets", the ppcnt register is accessed to get the latest values. But currently this command can not get the correct values from ppcnt. From firmware manual, before getting the 802_3 counters, the 802_3 data layout should be set to the ppcnt register. When the command "cat /sys/class/net/NIC/statistics/tx_packets" is run, before updating 802_3 data layout with ppcnt register, the monitor counters are tested. The test result will decide the 802_3 data layout is updated or not. Actually the monitor counters do not support to monitor rx/tx stats of 802_3 in switchdev mode. So the rx/tx counters change will not trigger monitor counters. So the 802_3 data layout will not be updated in ppcnt register. Finally this command can not get the latest values from ppcnt register with 802_3 data layout. Fixes: 5c7e8bbb ("net/mlx5e: Use monitor counters for update stats") Signed-off-by: Zhu Yanjun <yanjunz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
-
Saeed Mahameed authored
MLX5_CORE uses the 'imply' keyword to depend on VXLAN, PTP_1588_CLOCK, MLXFW and PCI_HYPERV_INTERFACE. This was useful to force vxlan, ptp, etc.. to be reachable to mlx5 regardless of their config states. Due to the changes in the cited commit below, the semantics of 'imply' was changed to not force any restriction on the implied config. As a result of this change, the compilation of MLX5_CORE=y and VXLAN=m would result in undefined references, as VXLAN now would stay as 'm'. To fix this we change MLX5_CORE to have a weak dependency on these modules/configs and make sure they are reachable, by adding: depend on symbol || !symbol. For example: VXLAN=m MLX5_CORE=y, this will force MLX5_CORE to m Fixes: def2fbff ("kconfig: allow symbols implied by y to become m") Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Nicolas Pitre <nico@fluxnic.net> Cc: Arnd Bergmann <arnd@arndb.de> Reported-by: Randy Dunlap <rdunlap@infradead.org>
-
Maxim Mikityanskiy authored
XSK wakeup function triggers NAPI by posting a NOP WQE to a special XSK ICOSQ. When the application floods the driver with wakeup requests by calling sendto() in a certain pattern that ends up in mlx5e_trigger_irq, the XSK ICOSQ may overflow. Multiple NOPs are not required and won't accelerate the process, so avoid posting a second NOP if there is one already on the way. This way we also avoid increasing the queue size (which might not help anyway). Fixes: db05815b ("net/mlx5e: Add XSK zero-copy support") Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
-
Paul Blakey authored
After allowing parallel tuple insertion, we get the following trace: [ 5505.142249] ------------[ cut here ]------------ [ 5505.148155] WARNING: CPU: 21 PID: 13313 at lib/radix-tree.c:581 delete_node+0x16c/0x180 [ 5505.295553] CPU: 21 PID: 13313 Comm: kworker/u50:22 Tainted: G OE 5.6.0+ #78 [ 5505.304824] Hardware name: Supermicro Super Server/X10DRT-P, BIOS 2.0b 03/30/2017 [ 5505.313740] Workqueue: nf_flow_table_offload flow_offload_work_handler [nf_flow_table] [ 5505.323257] RIP: 0010:delete_node+0x16c/0x180 [ 5505.349862] RSP: 0018:ffffb19184eb7b30 EFLAGS: 00010282 [ 5505.356785] RAX: 0000000000000000 RBX: ffff904ac95b86d8 RCX: ffff904b6f938838 [ 5505.365190] RDX: 0000000000000000 RSI: ffff904ac954b908 RDI: ffff904ac954b920 [ 5505.373628] RBP: ffff904b4ac13060 R08: 0000000000000001 R09: 0000000000000000 [ 5505.382155] R10: 0000000000000000 R11: 0000000000000040 R12: 0000000000000000 [ 5505.390527] R13: ffffb19184eb7bfc R14: ffff904b6bef5800 R15: ffff90482c1203c0 [ 5505.399246] FS: 0000000000000000(0000) GS:ffff904c2fc80000(0000) knlGS:0000000000000000 [ 5505.408621] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5505.415739] CR2: 00007f5d27006010 CR3: 0000000058c10006 CR4: 00000000001626e0 [ 5505.424547] Call Trace: [ 5505.428429] idr_alloc_u32+0x7b/0xc0 [ 5505.433803] mlx5_tc_ct_entry_add_rule+0xbf/0x950 [mlx5_core] [ 5505.441354] ? mlx5_fc_create+0x23c/0x370 [mlx5_core] [ 5505.448225] mlx5_tc_ct_block_flow_offload+0x874/0x10b0 [mlx5_core] [ 5505.456278] ? mlx5_tc_ct_block_flow_offload+0x63d/0x10b0 [mlx5_core] [ 5505.464532] nf_flow_offload_tuple.isra.21+0xc5/0x140 [nf_flow_table] [ 5505.472286] ? __kmalloc+0x217/0x2f0 [ 5505.477093] ? flow_rule_alloc+0x1c/0x30 [ 5505.482117] flow_offload_work_handler+0x1d0/0x290 [nf_flow_table] [ 5505.489674] ? process_one_work+0x17c/0x580 [ 5505.494922] process_one_work+0x202/0x580 [ 5505.500082] ? process_one_work+0x17c/0x580 [ 5505.505696] worker_thread+0x4c/0x3f0 [ 5505.510458] kthread+0x103/0x140 [ 5505.514989] ? process_one_work+0x580/0x580 [ 5505.520616] ? kthread_bind+0x10/0x10 [ 5505.525837] ret_from_fork+0x3a/0x50 [ 5505.570841] ---[ end trace 07995de9c56d6831 ]--- This happens from parallel deletes/adds to idr, as idr isn't protected. Fix that by using xarray as the tuple_ids allocator instead of idr. Fixes: 7da182a9 ("netfilter: flowtable: Use work entry per offload command") Reviewed-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Paul Blakey <paulb@mellanox.com> Reviewed-by: Oz Shlomo <ozsh@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
-
Niklas Schnelle authored
On s390 FORCE_MAX_ZONEORDER is 9 instead of 11, thus a larger kzalloc() allocation as done for the firmware tracer will always fail. Looking at mlx5_fw_tracer_save_trace(), it is actually the driver itself that copies the debug data into the trace array and there is no need for the allocation to be contiguous in physical memory. We can therefor use kvzalloc() instead of kzalloc() and get rid of the large contiguous allcoation. Fixes: f53aaa31 ("net/mlx5: FW tracer, implement tracer logic") Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
-
Taehee Yoo authored
When team mode is changed or set, the team_mode_get() is called to check whether the mode module is inserted or not. If the mode module is not inserted, it calls the request_module(). In the request_module(), it creates a child process, which is the "modprobe" process and waits for the done of the child process. At this point, the following locks were used. down_read(&cb_lock()); by genl_rcv() genl_lock(); by genl_rcv_msc() rtnl_lock(); by team_nl_cmd_options_set() mutex_lock(&team->lock); by team_nl_team_get() Concurrently, the team module could be removed by rmmod or "modprobe -r" The __exit function of team module is team_module_exit(), which calls team_nl_fini() and it tries to acquire following locks. down_write(&cb_lock); genl_lock(); Because of the genl_lock() and cb_lock, this process can't be finished earlier than request_module() routine. The problem secenario. CPU0 CPU1 team_mode_get request_module() modprobe -r team_mode_roundrobin team <--(B) modprobe team <--(A) team_mode_roundrobin By request_module(), the "modprobe team_mode_roundrobin" command will be executed. At this point, the modprobe process will decide that the team module should be inserted before team_mode_roundrobin. Because the team module is being removed. By the module infrastructure, the same module insert/remove operations can't be executed concurrently. So, (A) waits for (B) but (B) also waits for (A) because of locks. So that the hang occurs at this point. Test commands: while : do teamd -d & killall teamd & modprobe -rv team_mode_roundrobin & done The approach of this patch is to hold the reference count of the team module if the team module is compiled as a module. If the reference count of the team module is not zero while request_module() is being called, the team module will not be removed at that moment. So that the above scenario could not occur. Fixes: 3d249d4c ("net: introduce ethernet teaming device") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Paolo Abeni says: ==================== mptcp: fix races on accept() This series includes some fixes for accept() races which may cause inconsistent MPTCP socket status and oops. Please see the individual patches for the technical details. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Paolo Abeni authored
We don't need them, as we can use the current ingress opt data instead. Setting them in syn_recv_sock() may causes inconsistent mptcp socket status, as per previous commit. Fixes: cc7972ea ("mptcp: parse and emit MP_CAPABLE option according to v1 spec") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Paolo Abeni authored
If multiple CPUs races on the same req_sock in syn_recv_sock(), flipping such field can cause inconsistent child socket status. When racing, the CPU losing the req ownership may still change the mptcp request socket mp_capable flag while the CPU owning the request is cloning the socket, leaving the child socket with 'is_mptcp' set but no 'mp_capable' flag. Such socket will stay with 'conn' field cleared, heading to oops in later mptcp callback. Address the issue tracking the fallback status in a local variable. Fixes: 58b09919 ("mptcp: create msk early") Co-developed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Florian Westphal authored
Following splat can occur during self test: BUG: KASAN: use-after-free in subflow_data_ready+0x156/0x160 Read of size 8 at addr ffff888100c35c28 by task mptcp_connect/4808 subflow_data_ready+0x156/0x160 tcp_child_process+0x6a3/0xb30 tcp_v4_rcv+0x2231/0x3730 ip_protocol_deliver_rcu+0x5c/0x860 ip_local_deliver_finish+0x220/0x360 ip_local_deliver+0x1c8/0x4e0 ip_rcv_finish+0x1da/0x2f0 ip_rcv+0xd0/0x3c0 __netif_receive_skb_one_core+0xf5/0x160 __netif_receive_skb+0x27/0x1c0 process_backlog+0x21e/0x780 net_rx_action+0x35f/0xe90 do_softirq+0x4c/0x50 [..] This occurs when accessing subflow_ctx->conn. Problem is that tcp_child_process() calls listen sockets' sk_data_ready() notification, but it doesn't hold the listener lock. Another cpu calling close() on the listener will then cause transition of refcount to 0. Fixes: 58b09919 ("mptcp: create msk early") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Rahul Lakkireddy authored
Fetching PTP sync information from mailbox is slow and can take up to 10 milliseconds. Reduce this unnecessary delay by directly reading the information from the corresponding registers. Fixes: 9c33e420 ("cxgb4: Add PTP Hardware Clock (PHC) support") Signed-off-by: Manoj Malviya <manojmalviya@chelsio.com> Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Marc Zyngier authored
Running with KASAN on a VIM3L systems leads to the following splat when probing the Ethernet device: ================================================================== BUG: KASAN: global-out-of-bounds in _get_maxdiv+0x74/0xd8 Read of size 4 at addr ffffa000090615f4 by task systemd-udevd/139 CPU: 1 PID: 139 Comm: systemd-udevd Tainted: G E 5.7.0-rc1-00101-g8624b7577b9c #781 Hardware name: amlogic w400/w400, BIOS 2020.01-rc5 03/12/2020 Call trace: dump_backtrace+0x0/0x2a0 show_stack+0x20/0x30 dump_stack+0xec/0x148 print_address_description.isra.12+0x70/0x35c __kasan_report+0xfc/0x1d4 kasan_report+0x4c/0x68 __asan_load4+0x9c/0xd8 _get_maxdiv+0x74/0xd8 clk_divider_bestdiv+0x74/0x5e0 clk_divider_round_rate+0x80/0x1a8 clk_core_determine_round_nolock.part.9+0x9c/0xd0 clk_core_round_rate_nolock+0xf0/0x108 clk_hw_round_rate+0xac/0xf0 clk_factor_round_rate+0xb8/0xd0 clk_core_determine_round_nolock.part.9+0x9c/0xd0 clk_core_round_rate_nolock+0xf0/0x108 clk_core_round_rate_nolock+0xbc/0x108 clk_core_set_rate_nolock+0xc4/0x2e8 clk_set_rate+0x58/0xe0 meson8b_dwmac_probe+0x588/0x72c [dwmac_meson8b] platform_drv_probe+0x78/0xd8 really_probe+0x158/0x610 driver_probe_device+0x140/0x1b0 device_driver_attach+0xa4/0xb0 __driver_attach+0xcc/0x1c8 bus_for_each_dev+0xf4/0x168 driver_attach+0x3c/0x50 bus_add_driver+0x238/0x2e8 driver_register+0xc8/0x1e8 __platform_driver_register+0x88/0x98 meson8b_dwmac_driver_init+0x28/0x1000 [dwmac_meson8b] do_one_initcall+0xa8/0x328 do_init_module+0xe8/0x368 load_module+0x3300/0x36b0 __do_sys_finit_module+0x120/0x1a8 __arm64_sys_finit_module+0x4c/0x60 el0_svc_common.constprop.2+0xe4/0x268 do_el0_svc+0x98/0xa8 el0_svc+0x24/0x68 el0_sync_handler+0x12c/0x318 el0_sync+0x158/0x180 The buggy address belongs to the variable: div_table.63646+0x34/0xfffffffffffffa40 [dwmac_meson8b] Memory state around the buggy address: ffffa00009061480: fa fa fa fa 00 00 00 01 fa fa fa fa 00 00 00 00 ffffa00009061500: 05 fa fa fa fa fa fa fa 00 04 fa fa fa fa fa fa >ffffa00009061580: 00 03 fa fa fa fa fa fa 00 00 00 00 00 00 fa fa ^ ffffa00009061600: fa fa fa fa 00 01 fa fa fa fa fa fa 01 fa fa fa ffffa00009061680: fa fa fa fa 00 01 fa fa fa fa fa fa 04 fa fa fa ================================================================== Digging into this indeed shows that the clock divider array is lacking a final fence, and that the clock subsystems goes in the weeds. Oh well. Let's add the empty structure that indicates the end of the array. Fixes: bd6f4854 ("net: stmmac: dwmac-meson8b: Fix the RGMII TX delay on Meson8b/8m2 SoCs") Signed-off-by: Marc Zyngier <maz@kernel.org> Cc: Martin Blumenstingl <martin.blumenstingl@googlemail.com> Reviewed-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
John Haxby authored
Commit b6f61189 ("ipv6: restrict IPV6_ADDRFORM operation") fixed a problem found by syzbot an unfortunate logic error meant that it also broke IPV6_ADDRFORM. Rearrange the checks so that the earlier test is just one of the series of checks made before moving the socket from IPv6 to IPv4. Fixes: b6f61189 ("ipv6: restrict IPV6_ADDRFORM operation") Signed-off-by: John Haxby <john.haxby@oracle.com> Cc: stable@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
-
Tang Bin authored
In the function bcm_sysport_probe(), when get irq failed, the function platform_get_irq() logs an error message, so remove redundant message here. Signed-off-by: Tang Bin <tangbin@cmss.chinamobile.com> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Tonghao Zhang authored
syzbot wrote: | ============================= | WARNING: suspicious RCU usage | 5.7.0-rc1+ #45 Not tainted | ----------------------------- | net/openvswitch/conntrack.c:1898 RCU-list traversed in non-reader section!! | | other info that might help us debug this: | rcu_scheduler_active = 2, debug_locks = 1 | ... | | stack backtrace: | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014 | Workqueue: netns cleanup_net | Call Trace: | ... | ovs_ct_exit | ovs_exit_net | ops_exit_list.isra.7 | cleanup_net | process_one_work | worker_thread To avoid that warning, invoke the ovs_ct_exit under ovs_lock and add lockdep_ovsl_is_held as optional lockdep expression. Link: https://lore.kernel.org/lkml/000000000000e642a905a0cbee6e@google.com Fixes: 11efd5cb ("openvswitch: Support conntrack zone limit") Cc: Pravin B Shelar <pshelar@ovn.org> Cc: Yi-Hung Wei <yihung.wei@gmail.com> Reported-by: syzbot+7ef50afd3a211f879112@syzkaller.appspotmail.com Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
- 18 Apr, 2020 8 commits
-
-
Eric Dumazet authored
TCP stack is dumb in how it cooks its output packets. Depending on MAX_HEADER value, we might chose a bad ending point for the headers. If we align the end of TCP headers to cache line boundary, we make sure to always use the smallest number of cache lines, which always help. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Florian Westphal says: ==================== mptcp: fix 'attempt to release socket in state...' splats These two patches fix error handling corner-cases where inet_sock_destruct gets called for a mptcp_sk that is not in TCP_CLOSE state. This results in unwanted error printks from the network stack. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Florian Westphal authored
We need to set sk_state to CLOSED, else we will get following: IPv4: Attempt to release TCP socket in state 3 00000000b95f109e IPv4: Attempt to release TCP socket in state 10 00000000b95f109e First one is from inet_sock_destruct(), second one from mptcp_sk_clone failure handling. Setting sk_state to CLOSED isn't enough, we also need to orphan sk so it has DEAD flag set. Otherwise, a very similar warning is printed from inet_sock_destruct(). Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Florian Westphal authored
Following snippet (replicated from syzkaller reproducer) generates warning: "IPv4: Attempt to release TCP socket in state 1". int main(void) { struct sockaddr_in sin1 = { .sin_family = 2, .sin_port = 0x4e20, .sin_addr.s_addr = 0x010000e0, }; struct sockaddr_in sin2 = { .sin_family = 2, .sin_addr.s_addr = 0x0100007f, }; struct sockaddr_in sin3 = { .sin_family = 2, .sin_port = 0x4e20, .sin_addr.s_addr = 0x0100007f, }; int r0 = socket(0x2, 0x1, 0x106); int r1 = socket(0x2, 0x1, 0x106); bind(r1, (void *)&sin1, sizeof(sin1)); connect(r1, (void *)&sin2, sizeof(sin2)); listen(r1, 3); return connect(r0, (void *)&sin3, 0x4d); } Reason is that the newly generated mptcp socket is closed via the ulp release of the tcp listener socket when its accept backlog gets purged. To fix this, delay setting the ESTABLISHED state until after userspace calls accept and via mptcp specific destructor. Fixes: 58b09919 ("mptcp: create msk early") Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/9Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
Commit 9ecc2d86 ("net/mlx4_en: add xdp forwarding and data write support") brought another indirect call in fast path. Use INDIRECT_CALL_2() helper to avoid the cost of the indirect call when/if CONFIG_RETPOLINE=y Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tariq Toukan <tariqt@mellanox.com> Cc: Willem de Bruijn <willemb@google.com> Reviewed-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Alexander Aring authored
This patch makes it impossible that cmpri or cmpre values are set to the value 16 which is not possible, because these are 4 bit values. We currently run in an overflow when assigning the value 16 to it. According to the standard a value of 16 can be interpreted as a full elided address which isn't possible to set as compression value. A reason why this cannot be set is that the current ipv6 header destination address should never show up inside the segments of the rpl header. In this case we run in a overflow and the address will have no compression at all. Means cmpri or compre is set to 0. As we handle cmpri and cmpre sometimes as unsigned char or 4 bit value inside the rpl header the current behaviour ends in an invalid header format. This patch simple use the best compression method if we ever run into the case that the destination address is showed up inside the rpl segments. We avoid the overflow handling and the rpl header is still valid, even when we have the destination address inside the rpl segments. Signed-off-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Julien Beraud authored
In fine adjustement mode, which is the current default, the sub-second increment register is the number of nanoseconds that will be added to the clock when the accumulator overflows. At each clock cycle, the value of the addend register is added to the accumulator. Currently, we use 20ns = 1e09ns / 50MHz as this value whatever the frequency of the ptp clock actually is. The adjustment is then done on the addend register, only incrementing every X clock cycles X being the ratio between 50MHz and ptp_clock_rate (addend = 2^32 * 50MHz/ptp_clock_rate). This causes the following issues : - In case the frequency of the ptp clock is inferior or equal to 50MHz, the addend value calculation will overflow and the default addend value will be set to 0, causing the clock to not work at all. (For instance, for ptp_clock_rate = 50MHz, addend = 2^32). - The resolution of the timestamping clock is limited to 20ns while it is not needed, thus limiting the accuracy of the timestamping to 20ns. Fix this by setting sub-second increment to 2e09ns / ptp_clock_rate. It will allow to reach the minimum possible frequency for ptp_clk_ref, which is 5MHz for GMII 1000Mps Full-Duplex by setting the sub-second-increment to a higher value. For instance, for 25MHz, it gives ssinc = 80ns and default_addend = 2^31. It will also allow to use a lower value for sub-second-increment, thus improving the timestamping accuracy with frequencies higher than 100MHz, for instance, for 200MHz, ssinc = 10ns and default_addend = 2^31. v1->v2: - Remove modifications to the calculation of default addend, which broke compatibility with clock frequencies for which 2000000000 / ptp_clk_freq is not an integer. - Modify description according to discussions. Signed-off-by: Julien Beraud <julien.beraud@orolia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Julien Beraud authored
There are 2 registers to write to enable a ptp ref clock coming from the fpga. One that enables the usage of the clock from the fpga for emac0 and emac1 as a ptp ref clock, and the other to allow signals from the fpga to reach emac0 and emac1. Currently, if the dwmac-socfpga has phymode set to PHY_INTERFACE_MODE_MII, PHY_INTERFACE_MODE_GMII, or PHY_INTERFACE_MODE_SGMII, both registers will be written and the ptp ref clock will be set as coming from the fpga. Separate the 2 register writes to only enable signals from the fpga to reach emac0 or emac1 when ptp ref clock is not coming from the fpga. Signed-off-by: Julien Beraud <julien.beraud@orolia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-