- 16 Nov, 2021 31 commits
-
-
Eric Dumazet authored
This is distracting really, let's make this simpler, because many callers had to take care of this by themselves, even if on x86 this adds more code than really needed. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
net->core.sock_inuse is a per cpu variable (int), while net->core.prot_inuse is another per cpu variable of 64 integers. per cpu allocator tend to place them in very different places. Grouping them together makes sense, since it makes updates potentially faster, if hitting the same cache line. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
MPTCP hard codes it, let us instead provide this helper. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
sock_prot_inuse_add() is very small, we can inline it. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Eric Dumazet says: ==================== gro: get out of core files Move GRO related content into net/core/gro.c and include/net/gro.h. This reduces GRO scope to where it is really needed, and shrinks too big files (include/linux/netdevice.h and net/core/dev.c) ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
Move gro code and data from net/core/dev.c to net/core/gro.c to ease maintenance. gro_normal_list() and gro_normal_one() are inlined because they are called from both files. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
net/core/gro.c will contain all core gro functions, to shrink net/core/skbuff.c and net/core/dev.c Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
This helper is used once, no need to keep it in fat net/core/skbuff.c Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
include/linux/netdevice.h became too big, move gro stuff into include/net/gro.h Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Eric Dumazet says: ==================== tcp: optimizations for linux-5.17 Mostly small improvements in this series. The notable change is in "defer skb freeing after socket lock is released" in recvmsg() (and RX zerocopy) The idea is to try to let skb freeing to BH handler, whenever possible, or at least perform the freeing outside of the socket lock section, for much improved performance. This idea can probably be extended to other protocols. Tests on a 100Gbit NIC Max throughput for one TCP_STREAM flow, over 10 runs. MTU : 1500 (1428 bytes of TCP payload per MSS) Before: 55 Gbit After: 66 Gbit MTU : 4096+ (4096 bytes of TCP payload, plus TCP/IPv6 headers) Before: 82 Gbit After: 95 Gbit ==================== Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
sk_rx_dst/sk_rx_dst_ifindex/sk_rx_dst_cookie are read in early demux, and currently spans two cache lines. Moving them close to sk_refcnt makes more sense, as only one cache line is needed. New layout for this hot cache line is : struct sock { struct sock_common __sk_common; /* 0 0x88 */ /* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */ struct dst_entry * sk_rx_dst; /* 0x88 0x8 */ int sk_rx_dst_ifindex; /* 0x90 0x4 */ u32 sk_rx_dst_cookie; /* 0x94 0x4 */ socket_lock_t sk_lock; /* 0x98 0x20 */ atomic_t sk_drops; /* 0xb8 0x4 */ int sk_rcvlowat; /* 0xbc 0x4 */ /* --- cacheline 3 boundary (192 bytes) --- */ Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
Under pressure, tcp recvmsg() has logic to process the socket backlog, but calls tcp_cleanup_rbuf() right before. Avoiding sending ACK right before processing new segments makes a lot of sense, as this decrease the number of ACK packets, with no impact on effective ACK clocking. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
Testing timeo before sk_err/sk_state/sk_shutdown makes more sense. Modern applications use non-blocking IO, while a socket is terminated only once during its life time. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
tcp recvmsg() (or rx zerocopy) spends a fair amount of time freeing skbs after their payload has been consumed. A typical ~64KB GRO packet has to release ~45 page references, eventually going to page allocator for each of them. Currently, this freeing is performed while socket lock is held, meaning that there is a high chance that BH handler has to queue incoming packets to tcp socket backlog. This can cause additional latencies, because the user thread has to process the backlog at release_sock() time, and while doing so, additional frames can be added by BH handler. This patch adds logic to defer these frees after socket lock is released, or directly from BH handler if possible. Being able to free these skbs from BH handler helps a lot, because this avoids the usual alloc/free assymetry, when BH handler and user thread do not run on same cpu or NUMA node. One cpu can now be fully utilized for the kernel->user copy, and another cpu is handling BH processing and skb/page allocs/frees (assuming RFS is not forcing use of a single CPU) Tested: 100Gbit NIC Max throughput for one TCP_STREAM flow, over 10 runs MTU : 1500 Before: 55 Gbit After: 66 Gbit MTU : 4096+(headers) Before: 82 Gbit After: 95 Gbit Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
TCP uses sk_eat_skb() when skbs can be removed from receive queue. However, the call to skb_orphan() from __kfree_skb() incurs an indirect call so sock_rfee(), which is more expensive than a direct call, especially for CONFIG_RETPOLINE=y. Add tcp_eat_recv_skb() function to make the call before __kfree_skb(). Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
Use some unlikely() hints in the fast path. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
tcp_poll() and tcp_ioctl() are reading tp->urg_data without socket lock owned. Also, it is faster to first check tp->urg_data in tcp_poll(), then tp->urg_seq == tp->copied_seq, because tp->urg_seq is located in a different/cold cache line. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
tcp_segs_in() can be called from BH, while socket spinlock is held but socket owned by user, eventually reading these fields from tcp_get_info() Found by code inspection, no need to backport this patch to older kernels. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
Use INDIRECT_CALL_INET() to avoid an indirect call when/if CONFIG_RETPOLINE=y Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
When reading large chunks of data, incoming packets might be added to the backlog from BH. tcp recvmsg() detects the backlog queue is not empty, and uses a release_sock()/lock_sock() pair to process this backlog. We now have __sk_flush_backlog() to perform this a bit faster. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
tcp_memory_allocated and tcp_sockets_allocated often share a common cache line, source of false sharing. Also take care of udp_memory_allocated and mptcp_sockets_allocated. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
(struct proto)->sk_forward_alloc is currently only used by MPTCP. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
Move sk_bind_phc next to sk_peer_lock to fill a hole. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
gso_size can be moved after tclass, to use an existing hole. (8 bytes saved on 64bit arches) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
Instead of using a full netdev_features_t, we can use a single bit, as sk_route_nocaps is only used to remove NETIF_F_GSO_MASK from sk->sk_route_cap. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
We were only using one bit, and we can replace it by sk_is_tcp() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
Move sk_is_tcp() to include/net/sock.h and use it where we can. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
For TCP flows, inet6_sk(sk)->saddr has the same value than sk->sk_v6_rcv_saddr. Using sk->sk_v6_rcv_saddr increases data locality. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
For some reason, I forgot to change __tcp_v6_send_check() at the same time I removed (ip_summed == CHECKSUM_PARTIAL) check in __tcp_v4_send_check() Fixes: 98be9b12 ("tcp: remove dead code after CHECKSUM_PARTIAL adoption") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
If packet is going to be coalesced, sk_sndbuf/sk_rcvbuf values are not used. Defer their access to the point we need them. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Sean Anderson authored
There were several cases where validate() would return bogus supported modes with unusual combinations of interfaces and capabilities. For example, if state->interface was 10GBASER and the macb had HIGH_SPEED and PCS but not GIGABIT MODE, then 10/100 modes would be set anyway. In another case, SGMII could be enabled even if the mac was not a GEM (despite this being checked for later on in mac_config()). These inconsistencies make it difficult to refactor this function cleanly. There is still the open question of what exactly the requirements for SGMII and 10GBASER are, and what SGMII actually supports. If someone from Cadence (or anyone else with access to the GEM/MACB datasheet) could comment on this, it would be greatly appreciated. In particular, what is supported by Cadence vs. vendor extension/limitation? To address this, the current logic is split into three parts. First, we determine what we support, then we eliminate unsupported interfaces, and finally we set the appropriate link modes. There is still some cruft related to NA, but this can be removed in a future patch. Signed-off-by: Sean Anderson <sean.anderson@seco.com> Reviewed-by: Parshuram Thombare <pthombar@cadence.com> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://lore.kernel.org/r/20211112190400.1937855-1-sean.anderson@seco.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
- 15 Nov, 2021 9 commits
-
-
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski authored
Daniel Borkmann says: ==================== pull-request: bpf-next 2021-11-15 We've added 72 non-merge commits during the last 13 day(s) which contain a total of 171 files changed, 2728 insertions(+), 1143 deletions(-). The main changes are: 1) Add btf_type_tag attributes to bring kernel annotations like __user/__rcu to BTF such that BPF verifier will be able to detect misuse, from Yonghong Song. 2) Big batch of libbpf improvements including various fixes, future proofing APIs, and adding a unified, OPTS-based bpf_prog_load() low-level API, from Andrii Nakryiko. 3) Add ingress_ifindex to BPF_SK_LOOKUP program type for selectively applying the programmable socket lookup logic to packets from a given netdev, from Mark Pashmfouroush. 4) Remove the 128M upper JIT limit for BPF programs on arm64 and add selftest to ensure exception handling still works, from Russell King and Alan Maguire. 5) Add a new bpf_find_vma() helper for tracing to map an address to the backing file such as shared library, from Song Liu. 6) Batch of various misc fixes to bpftool, fixing a memory leak in BPF program dump, updating documentation and bash-completion among others, from Quentin Monnet. 7) Deprecate libbpf bpf_program__get_prog_info_linear() API and migrate its users as the API is heavily tailored around perf and is non-generic, from Dave Marchevsky. 8) Enable libbpf's strict mode by default in bpftool and add a --legacy option as an opt-out for more relaxed BPF program requirements, from Stanislav Fomichev. 9) Fix bpftool to use libbpf_get_error() to check for errors, from Hengqi Chen. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (72 commits) bpftool: Use libbpf_get_error() to check error bpftool: Fix mixed indentation in documentation bpftool: Update the lists of names for maps and prog-attach types bpftool: Fix indent in option lists in the documentation bpftool: Remove inclusion of utilities.mak from Makefiles bpftool: Fix memory leak in prog_dump() selftests/bpf: Fix a tautological-constant-out-of-range-compare compiler warning selftests/bpf: Fix an unused-but-set-variable compiler warning bpf: Introduce btf_tracing_ids bpf: Extend BTF_ID_LIST_GLOBAL with parameter for number of IDs bpftool: Enable libbpf's strict mode by default docs/bpf: Update documentation for BTF_KIND_TYPE_TAG support selftests/bpf: Clarify llvm dependency with btf_tag selftest selftests/bpf: Add a C test for btf_type_tag selftests/bpf: Rename progs/tag.c to progs/btf_decl_tag.c selftests/bpf: Test BTF_KIND_DECL_TAG for deduplication selftests/bpf: Add BTF_KIND_TYPE_TAG unit tests selftests/bpf: Test libbpf API function btf__add_type_tag() bpftool: Support BTF_KIND_TYPE_TAG libbpf: Support BTF_KIND_TYPE_TAG ... ==================== Link: https://lore.kernel.org/r/20211115162008.25916-1-daniel@iogearbox.netSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jakub Kicinski authored
This reverts commit 71812af7, reversing changes made to cc0be1ad. Wolfram Sang says: Please revert. Besides the driver in net, it modifies the I2C core code. This has not been acked by the I2C maintainer (in this case me). So, please don't pull this in via the net tree. The question raised here (extending SMBus calls to 255 byte) is complicated because we need ABI backwards compatibility. Link: https://lore.kernel.org/all/YZJ9H4eM%2FM7OXVN0@shikoro/Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
David S. Miller authored
Russell King says: ==================== introduce generic phylink validation The various validate method implementations we have in phylink users have been quite repetitive but also prone to bugs. These patches introduce a generic implementation which relies solely on the supported_interfaces bitmap introduced during last cycle, and in the first patch, a bit array of MAC capabilities. MAC drivers are free to continue to do their own thing if they have special requirements - such as mvneta and mvpp2 which do not support 1000base-X without AN enabled. Most implementations currently in the kernel can be converted to call phylink_generic_validate() directly from the phylink MAC operations structure once they fill in the supported_interfaces and mac_capabilities members of phylink_config. This series introduces the generic implementation, and converts mvneta and mvpp2 to use it. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Russell King (Oracle) authored
Convert mvpp2 to use phylink_generic_validate() for the bulk of its validate() implementation. This network adapter has a restriction that for 802.3z links, autonegotiation must be enabled. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Russell King (Oracle) authored
Convert mvneta to use phylink_generic_validate() for the bulk of its validate() implementation. This network adapter has a restriction that for 802.3z links, autonegotiation must be enabled. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Russell King (Oracle) authored
Add a generic validate() implementation using the supported_interfaces and a bitmask of MAC pause/speed/duplex capabilities. This allows us to entirely eliminate many driver private validate() implementations. We expose the underlying phylink_get_linkmodes() function so that drivers which have special needs can still benefit from conversion. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Christophe Leroy authored
CHECK drivers/net/wan/fsl_ucc_hdlc.c drivers/net/wan/fsl_ucc_hdlc.c:309:57: warning: incorrect type in argument 2 (different address spaces) drivers/net/wan/fsl_ucc_hdlc.c:309:57: expected void [noderef] __iomem * drivers/net/wan/fsl_ucc_hdlc.c:309:57: got restricted __be16 * drivers/net/wan/fsl_ucc_hdlc.c:311:46: warning: incorrect type in argument 2 (different address spaces) drivers/net/wan/fsl_ucc_hdlc.c:311:46: expected void [noderef] __iomem * drivers/net/wan/fsl_ucc_hdlc.c:311:46: got restricted __be32 * drivers/net/wan/fsl_ucc_hdlc.c:320:57: warning: incorrect type in argument 2 (different address spaces) drivers/net/wan/fsl_ucc_hdlc.c:320:57: expected void [noderef] __iomem * drivers/net/wan/fsl_ucc_hdlc.c:320:57: got restricted __be16 * drivers/net/wan/fsl_ucc_hdlc.c:322:46: warning: incorrect type in argument 2 (different address spaces) drivers/net/wan/fsl_ucc_hdlc.c:322:46: expected void [noderef] __iomem * drivers/net/wan/fsl_ucc_hdlc.c:322:46: got restricted __be32 * drivers/net/wan/fsl_ucc_hdlc.c:372:29: warning: incorrect type in assignment (different base types) drivers/net/wan/fsl_ucc_hdlc.c:372:29: expected unsigned short [usertype] drivers/net/wan/fsl_ucc_hdlc.c:372:29: got restricted __be16 [usertype] drivers/net/wan/fsl_ucc_hdlc.c:379:36: warning: restricted __be16 degrades to integer drivers/net/wan/fsl_ucc_hdlc.c:402:12: warning: incorrect type in assignment (different address spaces) drivers/net/wan/fsl_ucc_hdlc.c:402:12: expected struct qe_bd [noderef] __iomem *bd drivers/net/wan/fsl_ucc_hdlc.c:402:12: got struct qe_bd *curtx_bd drivers/net/wan/fsl_ucc_hdlc.c:425:20: warning: incorrect type in assignment (different address spaces) drivers/net/wan/fsl_ucc_hdlc.c:425:20: expected struct qe_bd [noderef] __iomem *[assigned] bd drivers/net/wan/fsl_ucc_hdlc.c:425:20: got struct qe_bd *tx_bd_base drivers/net/wan/fsl_ucc_hdlc.c:427:16: error: incompatible types in comparison expression (different address spaces): drivers/net/wan/fsl_ucc_hdlc.c:427:16: struct qe_bd [noderef] __iomem * drivers/net/wan/fsl_ucc_hdlc.c:427:16: struct qe_bd * drivers/net/wan/fsl_ucc_hdlc.c:462:33: warning: incorrect type in argument 1 (different address spaces) drivers/net/wan/fsl_ucc_hdlc.c:506:41: warning: incorrect type in argument 1 (different address spaces) drivers/net/wan/fsl_ucc_hdlc.c:528:33: warning: incorrect type in argument 1 (different address spaces) drivers/net/wan/fsl_ucc_hdlc.c:552:38: warning: incorrect type in argument 1 (different address spaces) drivers/net/wan/fsl_ucc_hdlc.c:596:67: warning: incorrect type in argument 2 (different address spaces) drivers/net/wan/fsl_ucc_hdlc.c:611:41: warning: incorrect type in argument 1 (different address spaces) drivers/net/wan/fsl_ucc_hdlc.c:851:38: warning: incorrect type in initializer (different address spaces) drivers/net/wan/fsl_ucc_hdlc.c:854:40: warning: incorrect type in argument 1 (different address spaces) drivers/net/wan/fsl_ucc_hdlc.c:855:40: warning: incorrect type in argument 1 (different address spaces) drivers/net/wan/fsl_ucc_hdlc.c:858:39: warning: incorrect type in argument 1 (different address spaces) drivers/net/wan/fsl_ucc_hdlc.c:861:37: warning: incorrect type in argument 2 (different address spaces) drivers/net/wan/fsl_ucc_hdlc.c:866:38: warning: incorrect type in initializer (different address spaces) drivers/net/wan/fsl_ucc_hdlc.c:868:21: warning: incorrect type in argument 1 (different address spaces) drivers/net/wan/fsl_ucc_hdlc.c:870:40: warning: incorrect type in argument 2 (different address spaces) drivers/net/wan/fsl_ucc_hdlc.c:871:40: warning: incorrect type in argument 2 (different address spaces) drivers/net/wan/fsl_ucc_hdlc.c:873:39: warning: incorrect type in argument 2 (different address spaces) drivers/net/wan/fsl_ucc_hdlc.c:993:57: warning: incorrect type in argument 2 (different address spaces) drivers/net/wan/fsl_ucc_hdlc.c:995:46: warning: incorrect type in argument 2 (different address spaces) drivers/net/wan/fsl_ucc_hdlc.c:1004:57: warning: incorrect type in argument 2 (different address spaces) drivers/net/wan/fsl_ucc_hdlc.c:1006:46: warning: incorrect type in argument 2 (different address spaces) drivers/net/wan/fsl_ucc_hdlc.c:412:35: warning: dereference of noderef expression drivers/net/wan/fsl_ucc_hdlc.c:412:35: warning: dereference of noderef expression drivers/net/wan/fsl_ucc_hdlc.c:724:29: warning: dereference of noderef expression drivers/net/wan/fsl_ucc_hdlc.c:815:21: warning: dereference of noderef expression drivers/net/wan/fsl_ucc_hdlc.c:1021:29: warning: dereference of noderef expression Most of the warnings are due to DMA memory being incorrectly handled as IO memory. Fix it by doing direct read/write and doing proper dma_rmb() / dma_wmb(). Other problems are type mismatches or lack of use of IO accessors. Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Reported-by: kernel test robot <lkp@intel.com> Link: https://lkml.org/lkml/2021/11/12/647Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Yihao Han authored
Use the macro 'swap()' defined in 'include/linux/minmax.h' to avoid opencoding it. Signed-off-by: Yihao Han <hanyihao@vivo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Guo Zhengkui authored
ARRAY_SIZE defined in <linux/kernel.h> is safer than self-defined macros to get size of an array such as ARRAY_LEN used here. Because ARRAY_SIZE uses __must_be_array(arr) to ensure arr is really an array. Reported-by: Alejandro Colomar <colomar.6.4.3@gmail.com> Signed-off-by: Guo Zhengkui <guozhengkui@vivo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-