- 28 Jun, 2021 4 commits
-
-
Jakub Kicinski authored
xdp_rxq_info_unreg() implicitly calls xdp_rxq_info_unreg_mem_model(). This may well be confusing to the driver authors, and lead to double free if they call xdp_rxq_info_unreg_mem_model() before xdp_rxq_info_unreg() (when mem model type == MEM_TYPE_PAGE_POOL). In fact error path of mvpp2_rxq_init() seems to currently do exactly that. The double free will result in refcount underflow in page_pool_destroy(). Make the interface a little more programmer friendly by clearing type and id so that xdp_rxq_info_unreg_mem_model() can be called multiple times. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20210625221612.2637086-1-kuba@kernel.org
-
Rustam Kovhaev authored
kmemleak scans struct page, but it does not scan the page content. If we allocate some memory with kmalloc(), then allocate page with alloc_page(), and if we put kmalloc pointer somewhere inside that page, kmemleak will report kmalloc pointer as a false positive. We can instruct kmemleak to scan the memory area by calling kmemleak_alloc() and kmemleak_free(), but part of struct bpf_ringbuf is mmaped to user space, and if struct bpf_ringbuf changes we would have to revisit and review size argument in kmemleak_alloc(), because we do not want kmemleak to scan the user space memory. Let's simplify things and use kmemleak_not_leak() here. For posterity, also adding additional prior analysis from Andrii: I think either kmemleak or syzbot are misreporting this. I've added a bunch of printks around all allocations performed by BPF ringbuf. [...] On repro side I get these two warnings: [vmuser@archvm bpf]$ sudo ./repro BUG: memory leak unreferenced object 0xffff88810d538c00 (size 64): comm "repro", pid 2140, jiffies 4294692933 (age 14.540s) hex dump (first 32 bytes): 00 af 19 04 00 ea ff ff c0 ae 19 04 00 ea ff ff ................ 80 ae 19 04 00 ea ff ff c0 29 2e 04 00 ea ff ff .........)...... backtrace: [<0000000077bfbfbd>] __bpf_map_area_alloc+0x31/0xc0 [<00000000587fa522>] ringbuf_map_alloc.cold.4+0x48/0x218 [<0000000044d49e96>] __do_sys_bpf+0x359/0x1d90 [<00000000f601d565>] do_syscall_64+0x2d/0x40 [<0000000043d3112a>] entry_SYSCALL_64_after_hwframe+0x44/0xae BUG: memory leak unreferenced object 0xffff88810d538c80 (size 64): comm "repro", pid 2143, jiffies 4294699025 (age 8.448s) hex dump (first 32 bytes): 80 aa 19 04 00 ea ff ff 00 ab 19 04 00 ea ff ff ................ c0 ab 19 04 00 ea ff ff 80 44 28 04 00 ea ff ff .........D(..... backtrace: [<0000000077bfbfbd>] __bpf_map_area_alloc+0x31/0xc0 [<00000000587fa522>] ringbuf_map_alloc.cold.4+0x48/0x218 [<0000000044d49e96>] __do_sys_bpf+0x359/0x1d90 [<00000000f601d565>] do_syscall_64+0x2d/0x40 [<0000000043d3112a>] entry_SYSCALL_64_after_hwframe+0x44/0xae Note that both reported leaks (ffff88810d538c80 and ffff88810d538c00) correspond to pages array bpf_ringbuf is allocating and tracking properly internally. Note also that syzbot repro doesn't close FD of created BPF ringbufs, and even when ./repro itself exits with error, there are still two forked processes hanging around in my system. So clearly ringbuf maps are alive at that point. So reporting any memory leak looks weird at that point, because that memory is being used by active referenced BPF ringbuf. It's also a question why repro doesn't clean up its forks. But if I do a `pkill repro`, I do see that all the allocated memory is /properly/ cleaned up [and the] "leaks" are deallocated properly. BTW, if I add close() right after bpf() syscall in syzbot repro, I see that everything is immediately deallocated, like designed. And no memory leak is reported. So I don't think the problem is anywhere in bpf_ringbuf code, rather in the leak detection and/or repro itself. Reported-by: syzbot+5d895828587f49e7fe9b@syzkaller.appspotmail.com Signed-off-by: Rustam Kovhaev <rkovhaev@gmail.com> [ Daniel: also included analysis from Andrii to the commit log ] Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: syzbot+5d895828587f49e7fe9b@syzkaller.appspotmail.com Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/CAEf4BzYk+dqs+jwu6VKXP-RttcTEGFe+ySTGWT9CRNkagDiJVA@mail.gmail.com Link: https://lore.kernel.org/lkml/YNTAqiE7CWJhOK2M@nuc10 Link: https://lore.kernel.org/lkml/20210615101515.GC26027@arm.com Link: https://syzkaller.appspot.com/bug?extid=5d895828587f49e7fe9b Link: https://lore.kernel.org/bpf/20210626181156.1873604-1-rkovhaev@gmail.com
-
Namhyung Kim authored
Allow the helper to be called from tracing programs. This is needed to handle cgroup hiererachies in the program. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20210627153627.824198-1-namhyung@kernel.org
-
Ravi Bangoria authored
Commit 4c5de127 ("bpf: Emit explicit NULL pointer checks for PROBE_LDX instructions.") is emitting a couple of instructions before the actual load. Consider those additional instructions while calculating extable offset. Fixes: 4c5de127 ("bpf: Emit explicit NULL pointer checks for PROBE_LDX instructions.") Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20210622110026.1157847-1-ravi.bangoria@linux.ibm.com
-
- 25 Jun, 2021 1 commit
-
-
Gary Lin authored
Per the kmsg document [0], if we don't specify the log level with a prefix "<N>" in the message string, the default log level will be applied to the message. Since the default level could be warning(4), this would make the log utility such as journalctl treat the message, "Started bpfilter", as a warning. To avoid confusion, this commit adds the prefix "<5>" to make the message always a notice. [0] https://www.kernel.org/doc/Documentation/ABI/testing/dev-kmsg Fixes: 36c4357c ("net: bpfilter: print umh messages to /dev/kmsg") Reported-by: Martin Loviska <mloviska@suse.com> Signed-off-by: Gary Lin <glin@suse.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Dmitrii Banshchikov <me@ubique.spb.ru> Link: https://lore.kernel.org/bpf/20210623040918.8683-1-glin@suse.com
-
- 24 Jun, 2021 24 commits
-
-
Toke Høiland-Jørgensen authored
The cpsw driver has rcu_read_lock()/rcu_read_unlock() pairs around XDP program invocations. However, the actual lifetime of the objects referred by the XDP program invocation is longer, all the way through to the call to xdp_do_flush(), making the scope of the rcu_read_lock() too small. This turns out to be harmless because it all happens in a single NAPI poll cycle (and thus under local_bh_disable()), but it makes the rcu_read_lock() misleading. Rather than extend the scope of the rcu_read_lock(), just get rid of it entirely. With the addition of RCU annotations to the XDP_REDIRECT map types that take bh execution into account, lockdep even understands this to be safe, so there's really no reason to keep it around. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Grygorii Strashko <grygorii.strashko@ti.com> Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com> Cc: linux-omap@vger.kernel.org Link: https://lore.kernel.org/bpf/20210624160609.292325-20-toke@redhat.com
-
Toke Høiland-Jørgensen authored
The stmmac driver has rcu_read_lock()/rcu_read_unlock() pairs around XDP program invocations. However, the actual lifetime of the objects referred by the XDP program invocation is longer, all the way through to the call to xdp_do_flush(), making the scope of the rcu_read_lock() too small. This turns out to be harmless because it all happens in a single NAPI poll cycle (and thus under local_bh_disable()), but it makes the rcu_read_lock() misleading. Rather than extend the scope of the rcu_read_lock(), just get rid of it entirely. With the addition of RCU annotations to the XDP_REDIRECT map types that take bh execution into account, lockdep even understands this to be safe, so there's really no reason to keep it around. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com> Cc: Alexandre Torgue <alexandre.torgue@foss.st.com> Cc: Jose Abreu <joabreu@synopsys.com> Link: https://lore.kernel.org/bpf/20210624160609.292325-19-toke@redhat.com
-
Toke Høiland-Jørgensen authored
The netsec driver has a rcu_read_lock()/rcu_read_unlock() pair around the full RX loop, covering everything up to and including xdp_do_flush(). This is actually the correct behaviour, but because it all happens in a single NAPI poll cycle (and thus under local_bh_disable()), it is also technically redundant. With the addition of RCU annotations to the XDP_REDIRECT map types that take bh execution into account, lockdep even understands this to be safe, so there's really no reason to keep the rcu_read_lock() around anymore, so let's just remove it. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Cc: Jassi Brar <jaswinder.singh@linaro.org> Link: https://lore.kernel.org/bpf/20210624160609.292325-18-toke@redhat.com
-
Toke Høiland-Jørgensen authored
The sfc driver has rcu_read_lock()/rcu_read_unlock() pairs around XDP program invocations. However, the actual lifetime of the objects referred by the XDP program invocation is longer, all the way through to the call to xdp_do_flush(), making the scope of the rcu_read_lock() too small. This turns out to be harmless because it all happens in a single NAPI poll cycle (and thus under local_bh_disable()), but it makes the rcu_read_lock() misleading. Rather than extend the scope of the rcu_read_lock(), just get rid of it entirely. With the addition of RCU annotations to the XDP_REDIRECT map types that take bh execution into account, lockdep even understands this to be safe, so there's really no reason to keep it around. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Edward Cree <ecree.xilinx@gmail.com> Cc: Martin Habets <habetsm.xilinx@gmail.com> Link: https://lore.kernel.org/bpf/20210624160609.292325-17-toke@redhat.com
-
Toke Høiland-Jørgensen authored
The qede driver has rcu_read_lock()/rcu_read_unlock() pairs around XDP program invocations. However, the actual lifetime of the objects referred by the XDP program invocation is longer, all the way through to the call to xdp_do_flush(), making the scope of the rcu_read_lock() too small. This turns out to be harmless because it all happens in a single NAPI poll cycle (and thus under local_bh_disable()), but it makes the rcu_read_lock() misleading. Rather than extend the scope of the rcu_read_lock(), just get rid of it entirely. With the addition of RCU annotations to the XDP_REDIRECT map types that take bh execution into account, lockdep even understands this to be safe, so there's really no reason to keep it around. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Ariel Elior <aelior@marvell.com> Cc: gr-everest-linux-l2@marvell.com Link: https://lore.kernel.org/bpf/20210624160609.292325-16-toke@redhat.com
-
Toke Høiland-Jørgensen authored
The nfp driver has rcu_read_lock()/rcu_read_unlock() pairs around XDP program invocations. However, the actual lifetime of the objects referred by the XDP program invocation is longer, all the way through to the call to xdp_do_flush(), making the scope of the rcu_read_lock() too small. While this is not actually an issue for the nfp driver because it doesn't support XDP_REDIRECT (and thus doesn't call xdp_do_flush()), the rcu_read_lock() is still unneeded. And With the addition of RCU annotations to the XDP_REDIRECT map types that take bh execution into account, lockdep even understands this to be safe, so there's really no reason to keep it around. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Simon Horman <simon.horman@netronome.com> Cc: oss-drivers@netronome.com Link: https://lore.kernel.org/bpf/20210624160609.292325-15-toke@redhat.com
-
Toke Høiland-Jørgensen authored
The mlx4 driver has rcu_read_lock()/rcu_read_unlock() pairs around XDP program invocations. However, the actual lifetime of the objects referred by the XDP program invocation is longer, all the way through to the call to xdp_do_flush(), making the scope of the rcu_read_lock() too small. This turns out to be harmless because it all happens in a single NAPI poll cycle (and thus under local_bh_disable()), but it makes the rcu_read_lock() misleading. Rather than extend the scope of the rcu_read_lock(), just get rid of it entirely. With the addition of RCU annotations to the XDP_REDIRECT map types that take bh execution into account, lockdep even understands this to be safe, so there's really no reason to keep it around. Also switch the RCU dereferences in the driver loop itself to the _bh variants. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/bpf/20210624160609.292325-14-toke@redhat.com
-
Toke Høiland-Jørgensen authored
The mvneta and mvpp2 drivers have rcu_read_lock()/rcu_read_unlock() pairs around XDP program invocations. However, the actual lifetime of the objects referred by the XDP program invocation is longer, all the way through to the call to xdp_do_flush(), making the scope of the rcu_read_lock() too small. This turns out to be harmless because it all happens in a single NAPI poll cycle (and thus under local_bh_disable()), but it makes the rcu_read_lock() misleading. Rather than extend the scope of the rcu_read_lock(), just get rid of it entirely. With the addition of RCU annotations to the XDP_REDIRECT map types that take bh execution into account, lockdep even understands this to be safe, so there's really no reason to keep it around. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Thomas Petazzoni <thomas.petazzoni@bootlin.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Marcin Wojtas <mw@semihalf.com> Link: https://lore.kernel.org/bpf/20210624160609.292325-13-toke@redhat.com
-
Toke Høiland-Jørgensen authored
The Intel drivers all have rcu_read_lock()/rcu_read_unlock() pairs around XDP program invocations. However, the actual lifetime of the objects referred by the XDP program invocation is longer, all the way through to the call to xdp_do_flush(), making the scope of the rcu_read_lock() too small. This turns out to be harmless because it all happens in a single NAPI poll cycle (and thus under local_bh_disable()), but it makes the rcu_read_lock() misleading. Rather than extend the scope of the rcu_read_lock(), just get rid of it entirely. With the addition of RCU annotations to the XDP_REDIRECT map types that take bh execution into account, lockdep even understands this to be safe, so there's really no reason to keep it around. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Jesper Dangaard Brouer <brouer@redhat.com> # i40e Cc: Jesse Brandeburg <jesse.brandeburg@intel.com> Cc: Tony Nguyen <anthony.l.nguyen@intel.com> Cc: intel-wired-lan@lists.osuosl.org Link: https://lore.kernel.org/bpf/20210624160609.292325-12-toke@redhat.com
-
Toke Høiland-Jørgensen authored
The dpaa and dpaa2 drivers have rcu_read_lock()/rcu_read_unlock() pairs around XDP program invocations. However, the actual lifetime of the objects referred by the XDP program invocation is longer, all the way through to the call to xdp_do_flush(), making the scope of the rcu_read_lock() too small. This turns out to be harmless because it all happens in a single NAPI poll cycle (and thus under local_bh_disable()), but it makes the rcu_read_lock() misleading. Rather than extend the scope of the rcu_read_lock(), just get rid of it entirely. With the addition of RCU annotations to the XDP_REDIRECT map types that take bh execution into account, lockdep even understands this to be safe, so there's really no reason to keep it around. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Camelia Groza <camelia.groza@nxp.com> Cc: Ioana Radulescu <ruxandra.radulescu@nxp.com> Cc: Madalin Bucur <madalin.bucur@nxp.com> Cc: Ioana Ciornei <ioana.ciornei@nxp.com> Link: https://lore.kernel.org/bpf/20210624160609.292325-11-toke@redhat.com
-
Toke Høiland-Jørgensen authored
The thunderx driver has rcu_read_lock()/rcu_read_unlock() pairs around XDP program invocations. However, the actual lifetime of the objects referred by the XDP program invocation is longer, all the way through to the call to xdp_do_flush(), making the scope of the rcu_read_lock() too small. This turns out to be harmless because it all happens in a single NAPI poll cycle (and thus under local_bh_disable()), but it makes the rcu_read_lock() misleading. Rather than extend the scope of the rcu_read_lock(), just get rid of it entirely. With the addition of RCU annotations to the XDP_REDIRECT map types that take bh execution into account, lockdep even understands this to be safe, so there's really no reason to keep it around. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Sunil Goutham <sgoutham@marvell.com> Cc: linux-arm-kernel@lists.infradead.org Link: https://lore.kernel.org/bpf/20210624160609.292325-10-toke@redhat.com
-
Toke Høiland-Jørgensen authored
The bnxt driver has rcu_read_lock()/rcu_read_unlock() pairs around XDP program invocations. However, the actual lifetime of the objects referred by the XDP program invocation is longer, all the way through to the call to xdp_do_flush(), making the scope of the rcu_read_lock() too small. This turns out to be harmless because it all happens in a single NAPI poll cycle (and thus under local_bh_disable()), but it makes the rcu_read_lock() misleading. Rather than extend the scope of the rcu_read_lock(), just get rid of it entirely. With the addition of RCU annotations to the XDP_REDIRECT map types that take bh execution into account, lockdep even understands this to be safe, so there's really no reason to keep it around. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Michael Chan <michael.chan@broadcom.com> Link: https://lore.kernel.org/bpf/20210624160609.292325-9-toke@redhat.com
-
Toke Høiland-Jørgensen authored
The ena driver has rcu_read_lock()/rcu_read_unlock() pairs around XDP program invocations. However, the actual lifetime of the objects referred by the XDP program invocation is longer, all the way through to the call to xdp_do_flush(), making the scope of the rcu_read_lock() too small. This turns out to be harmless because it all happens in a single NAPI poll cycle (and thus under local_bh_disable()), but it makes the rcu_read_lock() misleading. Rather than extend the scope of the rcu_read_lock(), just get rid of it entirely. With the addition of RCU annotations to the XDP_REDIRECT map types that take bh execution into account, lockdep even understands this to be safe, so there's really no reason to keep it around. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Saeed Bishara <saeedb@amazon.com> Cc: Guy Tzalik <gtzalik@amazon.com> Link: https://lore.kernel.org/bpf/20210624160609.292325-8-toke@redhat.com
-
Toke Høiland-Jørgensen authored
The rcu_read_lock() call in cls_bpf and act_bpf are redundant: on the TX side, there's already a call to rcu_read_lock_bh() in __dev_queue_xmit(), and on RX there's a covering rcu_read_lock() in netif_receive_skb{,_list}_internal(). With the previous patches we also amended the lockdep checks in the map code to not require any particular RCU flavour, so we can just get rid of the rcu_read_lock()s. Suggested-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20210624160609.292325-7-toke@redhat.com
-
Toke Høiland-Jørgensen authored
XDP_REDIRECT works by a three-step process: the bpf_redirect() and bpf_redirect_map() helpers will lookup the target of the redirect and store it (along with some other metadata) in a per-CPU struct bpf_redirect_info. Next, when the program returns the XDP_REDIRECT return code, the driver will call xdp_do_redirect() which will use the information thus stored to actually enqueue the frame into a bulk queue structure (that differs slightly by map type, but shares the same principle). Finally, before exiting its NAPI poll loop, the driver will call xdp_do_flush(), which will flush all the different bulk queues, thus completing the redirect. Pointers to the map entries will be kept around for this whole sequence of steps, protected by RCU. However, there is no top-level rcu_read_lock() in the core code; instead drivers add their own rcu_read_lock() around the XDP portions of the code, but somewhat inconsistently as Martin discovered[0]. However, things still work because everything happens inside a single NAPI poll sequence, which means it's between a pair of calls to local_bh_disable()/local_bh_enable(). So Paul suggested[1] that we could document this intention by using rcu_dereference_check() with rcu_read_lock_bh_held() as a second parameter, thus allowing sparse and lockdep to verify that everything is done correctly. This patch does just that: we add an __rcu annotation to the map entry pointers and remove the various comments explaining the NAPI poll assurance strewn through devmap.c in favour of a longer explanation in filter.c. The goal is to have one coherent documentation of the entire flow, and rely on the RCU annotations as a "standard" way of communicating the flow in the map code (which can additionally be understood by sparse and lockdep). The RCU annotation replacements result in a fairly straight-forward replacement where READ_ONCE() becomes rcu_dereference_check(), WRITE_ONCE() becomes rcu_assign_pointer() and xchg() and cmpxchg() gets wrapped in the proper constructs to cast the pointer back and forth between __rcu and __kernel address space (for the benefit of sparse). The one complication is that xskmap has a few constructions where double-pointers are passed back and forth; these simply all gain __rcu annotations, and only the final reference/dereference to the inner-most pointer gets changed. With this, everything can be run through sparse without eliciting complaints, and lockdep can verify correctness even without the use of rcu_read_lock() in the drivers. Subsequent patches will clean these up from the drivers. [0] https://lore.kernel.org/bpf/20210415173551.7ma4slcbqeyiba2r@kafai-mbp.dhcp.thefacebook.com/ [1] https://lore.kernel.org/bpf/20210419165837.GA975577@paulmck-ThinkPad-P17-Gen-1/Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20210624160609.292325-6-toke@redhat.com
-
Toke Høiland-Jørgensen authored
XDP programs are called from a NAPI poll context, which means the RCU reference liveness is ensured by local_bh_disable(). Add rcu_read_lock_bh_held() as a condition to the RCU checks for map lookups so lockdep understands that the dereferences are safe from inside *either* an rcu_read_lock() section *or* a local_bh_disable() section. While both bh_disabled and rcu_read_lock() provide RCU protection, they are semantically distinct, so we need both conditions to prevent lockdep complaints. This change is done in preparation for removing the redundant rcu_read_lock()s from drivers. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20210624160609.292325-5-toke@redhat.com
-
Toke Høiland-Jørgensen authored
This commit gives an example of non-obvious RCU reader/updater pairing in the guise of the XDP feature in networking, which calls BPF programs from network-driver NAPI (softirq) context. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20210624160609.292325-4-toke@redhat.com
-
Paul E. McKenney authored
This commit clarifies which primitives readers can use given that the corresponding updaters have made a specific choice. This commit also adds this information for the various RCU Tasks flavors. While in the area, it removes a paragraph that no longer applies in any straightforward manner. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20210624160609.292325-3-toke@redhat.com
-
Paul E. McKenney authored
The xchg() and cmpxchg() functions are sometimes used to carry out RCU updates. Unfortunately, this can result in sparse warnings for both the old-value and new-value arguments, as well as for the return value. The arguments can be dealt with using RCU_INITIALIZER(): old_p = xchg(&p, RCU_INITIALIZER(new_p)); But a sparse warning still remains due to assigning the __rcu pointer returned from xchg to the (most likely) non-__rcu pointer old_p. This commit therefore provides an unrcu_pointer() macro that strips the __rcu. This macro can be used as follows: old_p = unrcu_pointer(xchg(&p, RCU_INITIALIZER(new_p))); Reported-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20210624160609.292325-2-toke@redhat.com
-
Maciej Żenczykowski authored
Since we no longer modify gso_size, it is now theoretically safe to not set SKB_GSO_DODGY and reset gso_segs to zero. This also means the skb_is_gso_tcp() check should no longer be necessary. Unfortunately we cannot remove the skb_{decrease,increase}_gso_size() helpers, as they are still used elsewhere: bpf_skb_net_grow() without BPF_F_ADJ_ROOM_FIXED_GSO bpf_skb_net_shrink() without BPF_F_ADJ_ROOM_FIXED_GSO net/core/lwt_bpf.c's handle_gso_type() Signed-off-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Dongseok Yi <dseok.yi@samsung.com> Cc: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/bpf/20210617000953.2787453-3-zenczykowski@gmail.com
-
Maciej Żenczykowski authored
This is technically a backwards incompatible change in behaviour, but I'm going to argue that it is very unlikely to break things, and likely to fix *far* more then it breaks. In no particular order, various reasons follow: (a) I've long had a bug assigned to myself to debug a super rare kernel crash on Android Pixel phones which can (per stacktrace) be traced back to BPF clat IPv6 to IPv4 protocol conversion causing some sort of ugly failure much later on during transmit deep in the GSO engine, AFAICT precisely because of this change to gso_size, though I've never been able to manually reproduce it. I believe it may be related to the particular network offload support of attached USB ethernet dongle being used for tethering off of an IPv6-only cellular connection. The reason might be we end up with more segments than max permitted, or with a GSO packet with only one segment... (either way we break some assumption and hit a BUG_ON) (b) There is no check that the gso_size is > 20 when reducing it by 20, so we might end up with a negative (or underflowing) gso_size or a gso_size of 0. This can't possibly be good. Indeed this is probably somehow exploitable (or at least can result in a kernel crash) by delivering crafted packets and perhaps triggering an infinite loop or a divide by zero... As a reminder: gso_size (MSS) is related to MTU, but not directly derived from it: gso_size/MSS may be significantly smaller then one would get by deriving from local MTU. And on some NICs (which do loose MTU checking on receive, it may even potentially be larger, for example my work pc with 1500 MTU can receive 1520 byte frames [and sometimes does due to bugs in a vendor plat46 implementation]). Indeed even just going from 21 to 1 is potentially problematic because it increases the number of segments by a factor of 21 (think DoS, or some other crash due to too many segments). (c) It's always safe to not increase the gso_size, because it doesn't result in the max packet size increasing. So the skb_increase_gso_size() call was always unnecessary for correctness (and outright undesirable, see later). As such the only part which is potentially dangerous (ie. could cause backwards compatibility issues) is the removal of the skb_decrease_gso_size() call. (d) If the packets are ultimately destined to the local device, then there is absolutely no benefit to playing around with gso_size. It only matters if the packets will egress the device. ie. we're either forwarding, or transmitting from the device. (e) This logic only triggers for packets which are GSO. It does not trigger for skbs which are not GSO. It will not convert a non-GSO MTU sized packet into a GSO packet (and you don't even know what the MTU is, so you can't even fix it). As such your transmit path must *already* be able to handle an MTU 20 bytes larger then your receive path (for IPv4 to IPv6 translation) - and indeed 28 bytes larger due to IPv4 fragments. Thus removing the skb_decrease_gso_size() call doesn't actually increase the size of the packets your transmit side must be able to handle. ie. to handle non-GSO max-MTU packets, the IPv4/IPv6 device/ route MTUs must already be set correctly. Since for example with an IPv4 egress MTU of 1500, IPv4 to IPv6 translation will already build 1520 byte IPv6 frames, so you need a 1520 byte device MTU. This means if your IPv6 device's egress MTU is 1280, your IPv4 route must be 1260 (and actually 1252, because of the need to handle fragments). This is to handle normal non-GSO packets. Thus the reduction is simply not needed for GSO packets, because when they're correctly built, they will already be the right size. (f) TSO/GSO should be able to exactly undo GRO: the number of packets (TCP segments) should not be modified, so that TCP's MSS counting works correctly (this matters for congestion control). If protocol conversion changes the gso_size, then the number of TCP segments may increase or decrease. Packet loss after protocol conversion can result in partial loss of MSS segments that the sender sent. How's the sending TCP stack going to react to receiving ACKs/SACKs in the middle of the segments it sent? (g) skb_{decrease,increase}_gso_size() are already no-ops for GSO_BY_FRAGS case (besides triggering WARN_ON_ONCE). This means you already cannot guarantee that gso_size (and thus resulting packet MTU) is changed. ie. you must assume it won't be changed. (h) changing gso_size is outright buggy for UDP GSO packets, where framing matters (I believe that's also the case for SCTP, but it's already excluded by [g]). So the only remaining case is TCP, which also doesn't want it (see [f]). (i) see also the reasoning on the previous attempt at fixing this (commit fa7b83bf) which shows that the current behaviour causes TCP packet loss: In the forwarding path GRO -> BPF 6 to 4 -> GSO for TCP traffic, the coalesced packet payload can be > MSS, but < MSS + 20. bpf_skb_proto_6_to_4() will upgrade the MSS and it can be > the payload length. After then tcp_gso_segment checks for the payload length if it is <= MSS. The condition is causing the packet to be dropped. tcp_gso_segment(): [...] mss = skb_shinfo(skb)->gso_size; if (unlikely(skb->len <= mss)) goto out; [...] Thus changing the gso_size is simply a very bad idea. Increasing is unnecessary and buggy, and decreasing can go negative. Fixes: 6578171a ("bpf: add bpf_skb_change_proto helper") Signed-off-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Dongseok Yi <dseok.yi@samsung.com> Cc: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/bpf/CANP3RGfjLikQ6dg=YpBU0OeHvyv7JOki7CyOUS9modaXAi-9vQ@mail.gmail.com Link: https://lore.kernel.org/bpf/20210617000953.2787453-2-zenczykowski@gmail.com
-
Maciej Żenczykowski authored
This reverts commit fa7b83bf. See the followup commit for the reasoning why I believe the appropriate approach is to simply make this change without a flag, but it can basically be summarized as using this helper without the flag is bug-prone or outright buggy, and thus the default should be this new behaviour. As this commit has only made it into net-next/master, but not into any real release, such a backwards incompatible change is still ok. Signed-off-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Dongseok Yi <dseok.yi@samsung.com> Cc: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/bpf/20210617000953.2787453-1-zenczykowski@gmail.com
-
Sean Young authored
The syscall bpf(BPF_PROG_QUERY, &attr) should use the prog_cnt field to see how many entries user space provided and return ENOSPC if there are more programs than that. Before this patch, this is not checked and ENOSPC is never returned. Note that one lirc device is limited to 64 bpf programs, and user space I'm aware of -- ir-keytable -- always gives enough space for 64 entries already. However, we should not copy program ids than are requested. Signed-off-by: Sean Young <sean@mess.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20210623213754.632-1-sean@mess.org
-
Jiri Olsa authored
Removing unused cnt increase from EMIT macro together with cnt declarations. This was introduced in commit [1] to ensure proper code generation. But that code was removed in commit [2] and this extra code was left in. [1] b52f00e6 ("x86: bpf_jit: implement bpf_tail_call() helper") [2] ebf7d1f5 ("bpf, x64: rework pro/epilogue and tailcall handling in JIT") Signed-off-by: Jiri Olsa <jolsa@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20210623112504.709856-1-jolsa@kernel.org
-
- 23 Jun, 2021 1 commit
-
-
Ilya Maximets authored
Examples in this document use all kinds of indentation from 3 to 5 spaces and even mixed with tabs. Making them all even and equal to 4 spaces. Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20210622185647.3705104-1-i.maximets@ovn.org
-
- 22 Jun, 2021 2 commits
-
-
Kumar Kartikeya Dwivedi authored
Netlink helpers I added in 8bbb77b7 ("libbpf: Add various netlink helpers") used char * casts everywhere, and there were a few more that existed from before. Convert all of them to void * cast, as it is treated equivalently by clang/gcc for the purposes of pointer arithmetic and to follow the convention elsewhere in the kernel/libbpf. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20210619041454.417577-2-memxor@gmail.com
-
Kumar Kartikeya Dwivedi authored
Coverity complains about OOB writes to nlmsghdr. There is no OOB as we write to the trailing buffer, but static analyzers and compilers may rightfully be confused as the nlmsghdr pointer has subobject provenance (and hence subobject bounds). Fix this by using an explicit request structure containing the nlmsghdr, struct tcmsg/ifinfomsg, and attribute buffer. Also switch nh_tail (renamed to req_tail) to cast req * to char * so that it can be understood as arithmetic on pointer to the representation array (hence having same bound as request structure), which should further appease analyzers. As a bonus, callers don't have to pass sizeof(req) all the time now, as size is implicitly obtained using the pointer. While at it, also reduce the size of attribute buffer to 128 bytes (132 for ifinfomsg using functions due to the padding). Summary of problem: Even though C standard allows interconvertibility of pointer to first member and pointer to struct, for the purposes of alias analysis it would still consider the first as having pointer value "pointer to T" where T is type of first member hence having subobject bounds, allowing analyzers within reason to complain when object is accessed beyond the size of pointed to object. The only exception to this rule may be when a char * is formed to a member subobject. It is not possible for the compiler to be able to tell the intent of the programmer that it is a pointer to member object or the underlying representation array of the containing object, so such diagnosis is suppressed. Fixes: 715c5ce4 ("libbpf: Add low level TC-BPF management API") Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20210619041454.417577-1-memxor@gmail.com
-
- 21 Jun, 2021 1 commit
-
-
Jonathan Edwards authored
eBPF has been backported for RHEL 7 w/ kernel 3.10-940+ [0]. However only the following program types are supported [1]: BPF_PROG_TYPE_KPROBE BPF_PROG_TYPE_TRACEPOINT BPF_PROG_TYPE_PERF_EVENT For libbpf this causes an EINVAL return during the bpf_object__probe_loading call which only checks to see if programs of type BPF_PROG_TYPE_SOCKET_FILTER can load. The following will try BPF_PROG_TYPE_TRACEPOINT as a fallback attempt before erroring out. BPF_PROG_TYPE_KPROBE was not a good candidate because on some kernels it requires knowledge of the LINUX_VERSION_CODE. [0] https://www.redhat.com/en/blog/introduction-ebpf-red-hat-enterprise-linux-7 [1] https://access.redhat.com/articles/3550581Signed-off-by: Jonathan Edwards <jonathan.edwards@165gc.onmicrosoft.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210619151007.GA6963@165gc.onmicrosoft.com
-
- 18 Jun, 2021 4 commits
-
-
Grant Seltzer authored
This patch is meant to start the initiative to document libbpf. It includes .rst files which are text documentation describing building, API naming convention, as well as an index to generated API documentation. In this approach the generated API documentation is enabled by the kernels existing kernel documentation system which uses sphinx. The resulting docs would then be synced to kernel.org/doc You can test this by running `make htmldocs` and serving the html in Documentation/output. Since libbpf does not yet have comments in kernel doc format, see kernel.org/doc/html/latest/doc-guide/kernel-doc.html for an example so you can test this. The advantage of this approach is to use the existing sphinx infrastructure that the kernel has, and have libbpf docs in the same place as everything else. The current plan is to have the libbpf mirror sync the generated docs and version them based on the libbpf releases which are cut on github. This patch includes the addition of libbpf_api.rst which pulls comment documentation from header files in libbpf under tools/lib/bpf/. The comment docs would be of the standard kernel doc format. Signed-off-by: Grant Seltzer <grantseltzer@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20210618140459.9887-2-grantseltzer@gmail.com
-
Wang Hai authored
Fix to return a negative error code from the error handling case instead of 0, as done elsewhere in this function. If bpf_map_update_elem() failed, main() should return a negative error. Fixes: 832622e6 ("xdp: sample program for new bpf_redirect helper") Signed-off-by: Wang Hai <wanghai38@huawei.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210616042534.315097-1-wanghai38@huawei.com
-
Wang Hai authored
A Segmentation fault error is caused when the following command is executed. $ sudo ./samples/bpf/xdp_redirect lo Segmentation fault This command is missing a device <IFNAME|IFINDEX> as an argument, resulting in out-of-bounds access from argv. If the number of devices for the xdp_redirect parameter is not 2, we should report an error and exit. Fixes: 24251c26 ("samples/bpf: add option for native and skb mode for redirect apps") Signed-off-by: Wang Hai <wanghai38@huawei.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210616042324.314832-1-wanghai38@huawei.com
-
Andrii Nakryiko authored
Seems like 4d1b6298 ("selftests/bpf: Convert few tests to light skeleton.") and 704e2beb ("selftests/bpf: Test ringbuf mmap read-only and read-write restrictions") were done independently on bpf and bpf-next trees and are in conflict with each other, despite a clean merge. Fix fetching of ringbuf's map_fd to use light skeleton properly. Fixes: 704e2beb ("selftests/bpf: Test ringbuf mmap read-only and read-write restrictions") Fixes: 4d1b6298 ("selftests/bpf: Convert few tests to light skeleton.") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210618002824.2081922-1-andrii@kernel.org
-
- 17 Jun, 2021 3 commits
-
-
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queueDavid S. Miller authored
Tony Nguyen says: ==================== 100GbE Intel Wired LAN Driver Updates 2021-06-17 This series contains updates to ice driver only. Jake corrects a couple of entries in the PTYPE table to properly reflect the datasheet and removes unneeded NULL checks for some PTP calls. Paul reduces the scope of variables and removes the use of a local variable. Shaokun Zhang removes a duplicate function declaration. Lorenzo Bianconi fixes a compilation warning if PTP_1588_CLOCK is disabled. Colin Ian King changes a for loop to remove an unneeded 'continue'. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Guangbin Huang says: ==================== net: hdlc_ppp: clean up some code style issues This patchset clean up some code style issues. --- Change Log: V1 -> V2: 1. remove patch "net: hdlc_ppp: fix the comments style issue" and patch "net: hdlc_ppp: remove redundant spaces" from this patchset. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Peng Li authored
Add space required after that ','. Signed-off-by: Peng Li <lipeng321@huawei.com> Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-