- 19 Oct, 2018 8 commits
-
-
Alexei Starovoitov authored
Mauricio Vasquez says: ==================== In some applications this is needed have a pool of free elements, for example the list of free L4 ports in a SNAT. None of the current maps allow to do it as it is not possible to get any element without having they key it is associated to, even if it were possible, the lack of locking mecanishms in eBPF would do it almost impossible to be implemented without data races. This patchset implements two new kind of eBPF maps: queue and stack. Those maps provide to eBPF programs the peek, push and pop operations, and for userspace applications a new bpf_map_lookup_and_delete_elem() is added. Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it> v2 -> v3: - Remove "almost dead code" in syscall.c - Remove unnecessary copy_from_user in bpf_map_lookup_and_delete_elem - Rebase v1 -> v2: - Put ARG_PTR_TO_UNINIT_MAP_VALUE logic into a separated patch - Fix missing __this_cpu_dec & preempt_enable calls in kernel/bpf/syscall.c RFC v4 -> v1: - Remove roundup to power of 2 in memory allocation - Remove count and use a free slot to check if queue/stack is empty - Use if + assigment for wrapping indexes - Fix some minor style issues - Squash two patches together RFC v3 -> RFC v4: - Revert renaming of kernel/bpf/stackmap.c - Remove restriction on value size - Remove len arguments from peek/pop helpers - Add new ARG_PTR_TO_UNINIT_MAP_VALUE RFC v2 -> RFC v3: - Return elements by value instead that by reference - Implement queue/stack base on array and head + tail indexes - Rename stack trace related files to avoid confusion and conflicts RFC v1 -> RFC v2: - Create two separate maps instead of single one + flags - Implement bpf_map_lookup_and_delete syscall - Support peek operation - Define replacement policy through flags in the update() method - Add eBPF side tests ==================== Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Mauricio Vasquez B authored
test_maps: Tests that queue/stack maps are behaving correctly even in corner cases test_progs: Tests new ebpf helpers Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Mauricio Vasquez B authored
Sync both files. Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Mauricio Vasquez B authored
The previous patch implemented a bpf queue/stack maps that provided the peek/pop/push functions. There is not a direct relationship between those functions and the current maps syscalls, hence a new MAP_LOOKUP_AND_DELETE_ELEM syscall is added, this is mapped to the pop operation in the queue/stack maps and it is still to implement in other kind of maps. Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Mauricio Vasquez B authored
Queue/stack maps implement a FIFO/LIFO data storage for ebpf programs. These maps support peek, pop and push operations that are exposed to eBPF programs through the new bpf_map[peek/pop/push] helpers. Those operations are exposed to userspace applications through the already existing syscalls in the following way: BPF_MAP_LOOKUP_ELEM -> peek BPF_MAP_LOOKUP_AND_DELETE_ELEM -> pop BPF_MAP_UPDATE_ELEM -> push Queue/stack maps are implemented using a buffer, tail and head indexes, hence BPF_F_NO_PREALLOC is not supported. As opposite to other maps, queue and stack do not use RCU for protecting maps values, the bpf_map[peek/pop] have a ARG_PTR_TO_UNINIT_MAP_VALUE argument that is a pointer to a memory zone where to save the value of a map. Basically the same as ARG_PTR_TO_UNINIT_MEM, but the size has not be passed as an extra argument. Our main motivation for implementing queue/stack maps was to keep track of a pool of elements, like network ports in a SNAT, however we forsee other use cases, like for exampling saving last N kernel events in a map and then analysing from userspace. Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Mauricio Vasquez B authored
ARG_PTR_TO_UNINIT_MAP_VALUE argument is a pointer to a memory zone used to save the value of a map. Basically the same as ARG_PTR_TO_UNINIT_MEM, but the size has not be passed as an extra argument. This will be used in the following patch that implements some new helpers that receive a pointer to be filled with a map value. Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Mauricio Vasquez B authored
This commit adds the required logic to allow key being NULL in case the key_size of the map is 0. A new __bpf_copy_key function helper only copies the key from userpsace when key_size != 0, otherwise it enforces that key must be null. Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Mauricio Vasquez B authored
In the following patches queue and stack maps (FIFO and LIFO datastructures) will be implemented. In order to avoid confusion and a possible name clash rename stack_map_ops to stack_trace_map_ops Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
- 18 Oct, 2018 3 commits
-
-
Jakub Kicinski authored
The nfp driver is currently always JITing the BPF for 4 context/thread mode of the NFP flow processors. Tell this to the disassembler, otherwise some registers may be incorrectly decoded. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Jiong Wang <jiong.wang@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Peng Hao authored
FILE pointer variable f is opened but never closed. Signed-off-by: Peng Hao <peng.hao2@zte.com.cn> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Nicolas Dichtel authored
len_diff is signed. Fixes: fa15601a ("bpf: add documentation for eBPF helpers (33-41)") CC: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
- 17 Oct, 2018 5 commits
-
-
Daniel Borkmann authored
John Fastabend says: ==================== This adds support for the MSG_PEEK flag when redirecting into an ingress psock sk_msg queue. The first patch adds some base support to the helpers, then the feature, and finally we add an option for the test suite to do a duplicate MSG_PEEK call on every recv to test the feature. With duplicate MSG_PEEK call all tests continue to PASS. ==================== Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
John Fastabend authored
Add tests that do a MSG_PEEK recv followed by a regular receive to test flag support. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
John Fastabend authored
This adds support for the MSG_PEEK flag when doing redirect to ingress and receiving on the sk_msg psock queue. Previously the flag was being ignored which could confuse applications if they expected the flag to work as normal. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
John Fastabend authored
Currently sk_msg_used_element is only called in zerocopy context where cork is not possible and if this case happens we fallback to copy mode. However the helper is more useful if it works in all contexts. This patch resolved the case where if end == head indicating a full or empty ring the helper always reports an empty ring. To fix this add a test for the full ring case to avoid reporting a full ring has 0 elements. This additional functionality will be used in the next patches from recvmsg context where end = head with a full ring is a valid case. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
John Fastabend authored
When converting sockmap to new skmsg generic data structures we missed that the recvmsg handler did not correctly use sg.size and instead was using individual elements length. The result is if a sock is closed with outstanding data we omit the call to sk_mem_uncharge() and can get the warning below. [ 66.728282] WARNING: CPU: 6 PID: 5783 at net/core/stream.c:206 sk_stream_kill_queues+0x1fa/0x210 To fix this correct the redirect handler to xfer the size along with the scatterlist and also decrement the size from the recvmsg handler. Now when a sock is closed the remaining 'size' will be decremented with sk_mem_uncharge(). Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
- 16 Oct, 2018 24 commits
-
-
Alexei Starovoitov authored
Jakub Kicinski says: ==================== this set adds check to make sure offload behaviour is correct. First when atomic counters are used, we must make sure the map does not already contain data we did not prepare for holding atomics. Second patch double checks vNIC capabilities for program offload in case program is shared by multiple vNICs with different constraints. ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Jakub Kicinski authored
Program translation stage checks that program can be offloaded to the netdev which was passed during the load (bpf_attr->prog_ifindex). After program sharing was introduced, however, the netdev on which program is loaded can theoretically be different, and therefore we should recheck the program size and max stack size at load time. This was found by code inspection, AFAIK today all vNICs have identical caps. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Jakub Kicinski authored
Atomic operations on the NFP are currently always in big endian. The driver keeps track of regions of memory storing atomic values and byte swaps them accordingly. There are corner cases where the map values may be initialized before the driver knows they are used as atomic counters. This can happen either when the datapath is performing the update and the stack contents are unknown or when map is updated before the program which will use it for atomic values is loaded. To avoid situation where user initializes the value to 0 1 2 3 and then after loading a program which uses the word as an atomic counter starts reading 3 2 1 0 - only allow atomic counters to be initialized to endian-neutral values. For updates from the datapath the stack information may not be as precise, so just allow initializing such values to 0. Example code which would break: struct bpf_map_def SEC("maps") rxcnt = { .type = BPF_MAP_TYPE_HASH, .key_size = sizeof(__u32), .value_size = sizeof(__u64), .max_entries = 1, }; int xdp_prog1() { __u64 nonzeroval = 3; __u32 key = 0; __u64 *value; value = bpf_map_lookup_elem(&rxcnt, &key); if (!value) bpf_map_update_elem(&rxcnt, &key, &nonzeroval, BPF_ANY); else __sync_fetch_and_add(value, 1); return XDP_PASS; } $ offload bpftool map dump key: 00 00 00 00 value: 00 00 00 03 00 00 00 00 should be: $ offload bpftool map dump key: 00 00 00 00 value: 03 00 00 00 00 00 00 00 Reported-by: David Beckett <david.beckett@netronome.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Andrey Ignatov authored
Make global symbols in libbpf DSO hidden by default with -fvisibility=hidden and export symbols that are part of ABI explicitly with __attribute__((visibility("default"))). This is common practice that should prevent from accidentally exporting a symbol, that is not supposed to be a part of ABI what, in turn, improves both libbpf developer- and user-experiences. See [1] for more details. Export control becomes more important since more and more projects use libbpf. The patch doesn't export a bunch of netlink related functions since as agreed in [2] they'll be reworked. That doesn't break bpftool since bpftool links libbpf statically. [1] https://www.akkadia.org/drepper/dsohowto.pdf (2.2 Export Control) [2] https://www.mail-archive.com/netdev@vger.kernel.org/msg251434.htmlSigned-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Daniel Borkmann authored
Andrey reported a build error for the BPF kselftest suite when compiled on a machine which does not have tls related header bits installed natively: test_sockmap.c:120:23: fatal error: linux/tls.h: No such file or directory #include <linux/tls.h> ^ compilation terminated. Fix it by adding the header to the tools include infrastructure and add definitions such as SOL_TLS that could potentially be missing. Fixes: e9dd9047 ("bpf: add tls support for testing in test_sockmap") Reported-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
David S. Miller authored
David Ahern says: ==================== net: Kernel side filtering for route dumps Implement kernel side filtering of route dumps by protocol (e.g., which routing daemon installed the route), route type (e.g., unicast), table id and nexthop device. iproute2 has been doing this filtering in userspace for years; pushing the filters to the kernel side reduces the amount of data the kernel sends and reduces wasted cycles on both sides processing unwanted data. These initial options provide a huge improvement for efficiently examining routes on large scale systems. v2 - better handling of requests for a specific table. Rather than walking the hash of all tables, lookup the specific table and dump it - refactor mr_rtm_dumproute moving the loop over the table into a helper that can be invoked directly - add hook to return NLM_F_DUMP_FILTERED in DONE message to ensure it is returned even when the dump returns nothing ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
David Ahern authored
Unlike IPv6, IPv4 does not have routes marked with RTF_PREFIX_RT. If the flag is set in the dump request, just return. In the process of this change, move the CLONE check to use the new filter flags. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David Ahern authored
Similar to IPv4, IPv6 fib no longer contains cloned routes. If a user requests a route dump for only cloned entries, no sense walking the FIB and returning everything. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David Ahern authored
Update the dump request parsing in MPLS for the non-INET case to enable kernel side filtering. If INET is disabled the only filters that make sense for MPLS are protocol and nexthop device. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David Ahern authored
Update parsing of route dump request to enable kernel side filtering. Allow filtering results by protocol (e.g., which routing daemon installed the route), route type (e.g., unicast), table id and nexthop device. These amount to the low hanging fruit, yet a huge improvement, for dumping routes. ip_valid_fib_dump_req is called with RTNL held, so __dev_get_by_index can be used to look up the device index without taking a reference. From there filter->dev is only used during dump loops with the lock still held. Set NLM_F_DUMP_FILTERED in the answer_flags so the user knows the results have been filtered should no entries be returned. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David Ahern authored
Implement kernel side filtering of routes by egress device index and table id. If the table id is given in the filter, lookup table and call mr_table_dump directly for it. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David Ahern authored
Move per-table loops from mr_rtm_dumproute to mr_table_dump and export mr_table_dump for dumps by specific table id. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David Ahern authored
Implement kernel side filtering of routes by egress device index and protocol. MPLS uses only a single table and route type. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David Ahern authored
Implement kernel side filtering of routes by table id, egress device index, protocol, and route type. If the table id is given in the filter, lookup the table and call fib6_dump_table directly for it. Move the existing route flags check for prefix only routes to the new filter. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David Ahern authored
Implement kernel side filtering of routes by table id, egress device index, protocol and route type. If the table id is given in the filter, lookup the table and call fib_table_dump directly for it. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David Ahern authored
Add struct fib_dump_filter for options on limiting which routes are returned in a dump request. The current list is table id, protocol, route type, rtm_flags and nexthop device index. struct net is needed to lookup the net_device from the index. Declare the filter for each route dump handler and plumb the new arguments from dump handlers to ip_valid_fib_dump_req. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David Ahern authored
With dump filtering we need a way to ensure the NLM_F_DUMP_FILTERED flag is set on a message back to the user if the data returned is influenced by some input attributes. Normally this can be done as messages are added to the skb, but if the filter results in no data being returned, the user could be confused as to why. This patch adds answer_flags to the netlink_callback allowing dump handlers to set the NLM_F_DUMP_FILTERED at a minimum in the NLMSG_DONE message ensuring the flag gets back to the user. The netlink_callback space is initialized to 0 via a memset in __netlink_dump_start, so init of the new answer_flags is covered. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller authored
Daniel Borkmann says: ==================== pull-request: bpf-next 2018-10-16 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) Convert BPF sockmap and kTLS to both use a new sk_msg API and enable sk_msg BPF integration for the latter, from Daniel and John. 2) Enable BPF syscall side to indicate for maps that they do not support a map lookup operation as opposed to just missing key, from Prashant. 3) Add bpftool map create command which after map creation pins the map into bpf fs for further processing, from Jakub. 4) Add bpftool support for attaching programs to maps allowing sock_map and sock_hash to be used from bpftool, from John. 5) Improve syscall BPF map update/delete path for map-in-map types to wait a RCU grace period for pending references to complete, from Daniel. 6) Couple of follow-up fixes for the BPF socket lookup to get it enabled also when IPv6 is compiled as a module, from Joe. 7) Fix a generic-XDP bug to handle the case when the Ethernet header was mangled and thus update skb's protocol and data, from Jesper. 8) Add a missing BTF header length check between header copies from user space, from Wenwen. 9) Minor fixups in libbpf to use __u32 instead u32 types and include proper perf_event.h uapi header instead of perf internal one, from Yonghong. 10) Allow to pass user-defined flags through EXTRA_CFLAGS and EXTRA_LDFLAGS to bpftool's build, from Jiri. 11) BPF kselftest tweaks to add LWTUNNEL to config fragment and to install with_addr.sh script from flow dissector selftest, from Anders. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Heiner Kallweit authored
After commit 9f2959b6 ("net: phy: improve handling delayed work") the sync parameter isn't needed any longer in phy_start_aneg_priv(). This allows to merge phy_start_aneg() and phy_start_aneg_priv(). Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Haiyang Zhang authored
The VF device's serial number is saved as a string in PCI slot's kobj name, not the slot->number. This patch corrects the netvsc driver, so the VF device can be successfully paired with synthetic NIC. Fixes: 00d7ddba ("hv_netvsc: pair VF based on serial number") Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Reviewed-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Eric Dumazet says: ==================== tcp: second round for EDT conversion First round of EDT patches left TCP stack in a non optimal state. - High speed flows suffered from loss of performance, addressed by the first patch of this series. - Second patch brings pacing to the current state of networking, since we now reach ~100 Gbit on a single TCP flow. - Third patch implements a mitigation for scheduling delays, like the one we did in sch_fq in the past. - Fourth patch removes one special case in sch_fq for ACK packets. - Fifth patch removes a serious perfomance cost for TCP internal pacing. We should setup the high resolution timer only if really needed. - Sixth patch fixes a typo in BBR. - Last patch is one minor change in cdg congestion control. Neal Cardwell also has a patch series fixing BBR after EDT adoption. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
We store in tcp socket a cache of most recent high resolution clock, there is no need to call local_clock() again, since this cache is good enough. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Neal Cardwell authored
There was a typo in this parameter name. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
When TCP implements its own pacing (when no fq packet scheduler is used), it is arming high resolution timer after a packet is sent. But in many cases (like TCP_RR kind of workloads), this high resolution timer expires before the application attempts to write the following packet. This overhead also happens when the flow is ACK clocked and cwnd limited instead of being limited by the pacing rate. This leads to extra overhead (high number of IRQ) Now tcp_wstamp_ns is reserved for the pacing timer only (after commit "tcp: do not change tcp_wstamp_ns in tcp_mstamp_refresh"), we can setup the timer only when a packet is about to be sent, and if tcp_wstamp_ns is in the future. This leads to a ~10% performance increase in TCP_RR workloads. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-