Commits · 89af2ce2d95c80f6da9388d9da2e35c84c2573b1 · Kirill Smelkov / linux

16 May, 2022 40 commits

net: skb: Remove skb_data_area_size() · 89af2ce2

Ricardo Martinez authored May 13, 2022

skb_data_area_size() is not needed. As Jakub pointed out [1]:
For Rx, drivers can use the size passed during skb allocation or
use skb_tailroom().
For Tx, drivers should use skb_headlen().

[1] https://lore.kernel.org/netdev/CAHNKnsTmH-rGgWi3jtyC=ktM1DW2W1VJkYoTMJV2Z_Bt498bsg@mail.gmail.com/Signed-off-by: Ricardo Martinez <ricardo.martinez@linux.intel.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Sergey Ryazanov <ryazanov.s.a@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

89af2ce2

net: wwan: t7xx: Avoid calls to skb_data_area_size() · 262d98b1

Ricardo Martinez authored May 13, 2022

skb_data_area_size() helper was used to calculate the size of the
DMA mapped buffer passed to the HW. Instead of doing this, use the
size passed to allocate the skbs.
Signed-off-by: Ricardo Martinez <ricardo.martinez@linux.intel.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Sergey Ryazanov <ryazanov.s.a@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

262d98b1

Merge branch 'mptcp-updates-for-net-next' · 2ba5c816

Jakub Kicinski authored May 16, 2022

Mat Martineau says:

====================
mptcp: Updates for net-next

Three independent fixes/features from the MPTCP tree:

Patch 1 is a selftest workaround for older iproute2 packages.

Patch 2 removes superfluous locks that were added with recent MP_FAIL
patches.

Patch 3 adds support for the TCP_DEFER_ACCEPT sockopt.
====================

Link: https://lore.kernel.org/r/20220514002115.725976-1-mathew.j.martineau@linux.intel.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

2ba5c816

mptcp: sockopt: add TCP_DEFER_ACCEPT support · ea1e301d

Florian Westphal authored May 13, 2022

Support this via passthrough to the underlying tcp listener socket.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/271Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ea1e301d

Revert "mptcp: add data lock for sk timers" · 0ea53742

Paolo Abeni authored May 13, 2022

This reverts commit 4293248c.

Additional locks are not needed, all the touched sections
are already under mptcp socket lock protection.

Fixes: 4293248c ("mptcp: add data lock for sk timers")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

0ea53742

selftests: mptcp: fix a mp_fail test warning · c43ce398

Geliang Tang authored May 13, 2022

Old tc versions (iproute2 5.3) show actions in multiple lines, not a
single line. Then the following unexpected MP_FAIL selftest output
occurs:

 file received by server has inverted byte at 169
 ./mptcp_join.sh: line 1277: [: [{"total acts":1},{"actions":[{"order":0 pedit ,"control_action":{"type":"pipe"}keys 1
         index 1 ref 1 bind 1,"installed":0,"last_used":0
         key #0  at 148: val ff000000 mask ffffffff
 5: integer expression expected
 001 Infinite map                      syn[ ok ] - synack[ ok ] - ack[ ok ]
                                       sum[ ok ] - csum  [ ok ]
                                       ftx[ ok ] - failrx[ ok ]
                                       rtx[ ok ] - rstrx [ ok ]
                                       itx[ ok ] - infirx[ ok ]
                                       ftx[ ok ] - failrx[ ok ] invert

This patch adds a 'grep' before 'sed' to fix this.

Fixes: b6e074e1 ("selftests: mptcp: add infinite map testcase")
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

c43ce398

net: dsa: realtek: rtl8366rb: Serialize indirect PHY register access · f008f8d0

Alvin Šipraga authored May 13, 2022

Lock the regmap during the whole PHY register access routines in
rtl8366rb.
Signed-off-by: Alvin Šipraga <alsi@bang-olufsen.dk>
Tested-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Link: https://lore.kernel.org/r/20220513213618.2742895-1-linus.walleij@linaro.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

f008f8d0

net: phy: micrel: Use the kszphy probe/suspend/resume · 8e6004df

Fabio Estevam authored May 13, 2022

Now that it is possible to use .probe without having .driver_data, let
KSZ8061 use the kszphy specific hooks for probe,suspend and resume,
which is preferred.

Switch to using the dedicated kszphy probe/suspend/resume functions.
Signed-off-by: Fabio Estevam <festevam@denx.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20220513114613.762810-2-festevam@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

8e6004df

net: phy: micrel: Allow probing without .driver_data · f2ef6f75

Fabio Estevam authored May 13, 2022

Currently, if the .probe element is present in the phy_driver structure
and the .driver_data is not, a NULL pointer dereference happens.

Allow passing .probe without .driver_data by inserting NULL checks
for priv->type.
Signed-off-by: Fabio Estevam <festevam@denx.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20220513114613.762810-1-festevam@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

f2ef6f75

octeontx2-pf: Remove unnecessary synchronize_irq() before free_irq() · d887ae32

Minghao Chi authored May 13, 2022

Calling synchronize_irq() right before free_irq() is quite useless. On one
hand the IRQ can easily fire again before free_irq() is entered, on the
other hand free_irq() itself calls synchronize_irq() internally (in a race
condition free way), before any state associated with the IRQ is freed.
Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Minghao Chi <chi.minghao@zte.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>

d887ae32

net: wwan: t7xx: Fix return type of t7xx_dl_add_timedout() · b321dfaf

YueHaibing authored May 13, 2022

t7xx_dl_add_timedout() now return int 'ret', but the return type
is bool. Change the return type to int for furthor errcode upstream.
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b321dfaf

octeon_ep: delete unnecessary NULL check · 1dee43c2

Ziyang Xuan authored May 13, 2022

vfree(NULL) is safe. NULL check before vfree() is not needed.
Delete them to simplify the code.
Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1dee43c2

octeon_ep: add missing destroy_workqueue in octep_init_module · e68372ef

Zheng Bin authored May 13, 2022

octep_init_module misses destroy_workqueue in error path,
this patch fixes that.

Fixes: 862cd659 ("octeon_ep: Add driver framework and device initialization")
Signed-off-by: Zheng Bin <zhengbin13@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e68372ef

Merge branch 'net-skb-defer-freeing-polish' · ee3398c7

David S. Miller authored May 16, 2022

Eric Dumazet says:

====================
net: polish skb defer freeing

While testing this recently added feature on a variety
of platforms/configurations, I found the following issues:

1) A race leading to concurrent calls to smp_call_function_single_async()

2) Missed opportunity to use napi_consume_skb()

3) Need to limit the max length of the per-cpu lists.

4) Process the per-cpu list more frequently, for the
   (unusual) case where net_rx_action() has mutiple
   napi_poll() to process per round.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

ee3398c7

net: call skb_defer_free_flush() before each napi_poll() · 90987650

Eric Dumazet authored May 15, 2022

skb_defer_free_flush() can consume cpu cycles,
it seems better to call it in the inner loop:

- Potentially frees page/skb that will be reallocated while hot.

- Account for the cpu cycles in the @time_limit determination.

- Keep softnet_data.defer_count small to reduce chances for
  skb_attempt_defer_free() to send an IPI.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

90987650

net: add skb_defer_max sysctl · 39564c3f

Eric Dumazet authored May 15, 2022

commit 68822bdf ("net: generalize skb freeing
deferral to per-cpu lists") added another per-cpu
cache of skbs. It was expected to be small,
and an IPI was forced whenever the list reached 128
skbs.

We might need to be able to control more precisely
queue capacity and added latency.

An IPI is generated whenever queue reaches half capacity.

Default value of the new limit is 64.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

39564c3f

net: use napi_consume_skb() in skb_defer_free_flush() · 2db60eed

Eric Dumazet authored May 15, 2022

skb_defer_free_flush() runs from softirq context,
we have the opportunity to refill the napi_alloc_cache,
and/or use kmem_cache_free_bulk() when this cache is full.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2db60eed

net: fix possible race in skb_attempt_defer_free() · 97e719a8

Eric Dumazet authored May 15, 2022

A cpu can observe sd->defer_count reaching 128,
and call smp_call_function_single_async()

Problem is that the remote CPU can clear sd->defer_count
before the IPI is run/acknowledged.

Other cpus can queue more packets and also decide
to call smp_call_function_single_async() while the pending
IPI was not yet delivered.

This is a common issue with smp_call_function_single_async().
Callers must ensure correct synchronization and serialization.

I triggered this issue while experimenting smaller threshold.
Performing the call to smp_call_function_single_async()
under sd->defer_lock protection did not solve the problem.

Commit 5a18ceca ("smp: Allow smp_call_function_single_async()
to insert locked csd") replaced an informative WARN_ON_ONCE()
with a return of -EBUSY, which is often ignored.
Test of CSD_FLAG_LOCK presence is racy anyway.

Fixes: 68822bdf ("net: generalize skb freeing deferral to per-cpu lists")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

97e719a8

net: tulip: convert to devres · 3daebfbe

Rolf Eike Beer authored May 13, 2022

Works fine on my HP C3600:

[  274.452394] tulip0: no phy info, aborting mtable build
[  274.499041] tulip0:  MII transceiver #1 config 1000 status 782d advertising 01e1
[  274.750691] net eth0: Digital DS21142/43 Tulip rev 65 at MMIO 0xf4008000, 00:30:6e:08:7d:21, IRQ 17
[  283.104520] net eth0: Setting full-duplex based on MII#1 link partner capability of c1e1
Signed-off-by: Rolf Eike Beer <eike-kernel@sf-tec.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

3daebfbe

net: hinic: add missing destroy_workqueue in hinic_pf_to_mgmt_init · 382d917b

Zheng Bin authored May 13, 2022

hinic_pf_to_mgmt_init misses destroy_workqueue in error path,
this patch fixes that.

Fixes: 6dbb8901 ("hinic: fix sending mailbox timeout in aeq event work")
Signed-off-by: Zheng Bin <zhengbin13@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

382d917b

Merge branch 'skb-drop-reason-boundary' · 6ee1d84b

David S. Miller authored May 16, 2022

Menglong Dong says:

====================
net: skb: check the boundrary of skb drop reason

In the commit 1330b6ef ("skb: make drop reason booleanable"),
SKB_NOT_DROPPED_YET is added to the enum skb_drop_reason, which makes
the invalid drop reason SKB_NOT_DROPPED_YET can leak to the kfree_skb
tracepoint. Once this happen (it happened, as 4th patch says), it can
cause NULL pointer in drop monitor and result in kernel panic.

Therefore, check the boundrary of drop reason in both kfree_skb_reason
(2th patch) and drop monitor (1th patch) to prevent such case happens
again.

Meanwhile, fix the invalid drop reason passed to kfree_skb_reason() in
tcp_v4_rcv() and tcp_v6_rcv().

Changes since v2:
1/4 - don't reset the reason and print the debug warning only (Jakub
      Kicinski)
4/4 - remove new lines between tags

Changes since v1:
- consider tcp_v6_rcv() in the 4th patch
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

6ee1d84b

net: tcp: reset 'drop_reason' to NOT_SPCIFIED in tcp_v{4,6}_rcv() · f8319dfd

Menglong Dong authored May 13, 2022

The 'drop_reason' that passed to kfree_skb_reason() in tcp_v4_rcv()
and tcp_v6_rcv() can be SKB_NOT_DROPPED_YET(0), as it is used as the
return value of tcp_inbound_md5_hash().

And it can panic the kernel with NULL pointer in
net_dm_packet_report_size() if the reason is 0, as drop_reasons[0]
is NULL.

Fixes: 1330b6ef ("skb: make drop reason booleanable")
Reviewed-by: Jiang Biao <benbjiang@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f8319dfd

net: skb: change the definition SKB_DR_SET() · 7ebd3f3e

Menglong Dong authored May 13, 2022

The SKB_DR_OR() is used to set the drop reason to a value when it is
not set or specified yet. SKB_NOT_DROPPED_YET should also be considered
as not set.
Reviewed-by: Jiang Biao <benbjiang@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7ebd3f3e

net: skb: check the boundrary of drop reason in kfree_skb_reason() · 20bbcd0a

Menglong Dong authored May 13, 2022

Sometimes, we may forget to reset skb drop reason to NOT_SPECIFIED after
we make it the return value of the functions with return type of enum
skb_drop_reason, such as tcp_inbound_md5_hash. Therefore, its value can
be SKB_NOT_DROPPED_YET(0), which is invalid for kfree_skb_reason().

So we check the range of drop reason in kfree_skb_reason() with
DEBUG_NET_WARN_ON_ONCE().
Reviewed-by: Jiang Biao <benbjiang@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

20bbcd0a

net: dm: check the boundary of skb drop reasons · a3af33ab

Menglong Dong authored May 13, 2022

The 'reason' will be set to 'SKB_DROP_REASON_NOT_SPECIFIED' if it not
small that SKB_DROP_REASON_MAX in net_dm_packet_trace_kfree_skb_hit(),
but it can't avoid it to be 0, which is invalid and can cause NULL
pointer in drop_reasons.

Therefore, reset it to SKB_DROP_REASON_NOT_SPECIFIED when 'reason <= 0'.
Reviewed-by: Jiang Biao <benbjiang@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a3af33ab

net/smc: align the connect behaviour with TCP · 3aba1030

Guangguan Wang authored May 13, 2022

Connect with O_NONBLOCK will not be completed immediately
and returns -EINPROGRESS. It is possible to use selector/poll
for completion by selecting the socket for writing. After select
indicates writability, a second connect function call will return
0 to indicate connected successfully as TCP does, but smc returns
-EISCONN. Use socket state for smc to indicate connect state, which
can help smc aligning the connect behaviour with TCP.
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Acked-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3aba1030

Merge branch 'sk_bound_dev_if-annotations' · e97e68b5

David S. Miller authored May 16, 2022

Eric Dumazet says:

====================
net: add annotations for sk->sk_bound_dev_if

While writes on sk->sk_bound_dev_if are protected by socket lock,
we have many lockless reads all over the places.

This is based on syzbot report found in the first patch changelog.

v2: inline ipv6 function only defined if IS_ENABLED(CONFIG_IPV6) (kernel bots)
    Change the INET6_MATCH() to inet6_match(), this is no longer a macro.
    Change INET_MATCH() to inet_match() (Olivier Hartkopp & Jakub Kicinski)

====================
Signed-off-by: David S. Miller <davem@davemloft.net>

e97e68b5

inet: rename INET_MATCH() · eda090c3

Eric Dumazet authored May 13, 2022

This is no longer a macro, but an inlined function.

INET_MATCH() -> inet_match()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Suggested-by: Olivier Hartkopp <socketcan@hartkopp.net>
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

eda090c3

ipv6: add READ_ONCE(sk->sk_bound_dev_if) in INET6_MATCH() · 5d368f03

Eric Dumazet authored May 13, 2022

INET6_MATCH() runs without holding a lock on the socket.

We probably need to annotate most reads.

This patch makes INET6_MATCH() an inline function
to ease our changes.

v2: inline function only defined if IS_ENABLED(CONFIG_IPV6)
    Change the name to inet6_match(), this is no longer a macro.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5d368f03

l2tp: use add READ_ONCE() to fetch sk->sk_bound_dev_if · ff009403

Eric Dumazet authored May 13, 2022

Use READ_ONCE() in paths not holding the socket lock.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ff009403

net_sched: em_meta: add READ_ONCE() in var_sk_bound_if() · 70f87de9

Eric Dumazet authored May 13, 2022

sk->sk_bound_dev_if can change under us, use READ_ONCE() annotation.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

70f87de9

inet: add READ_ONCE(sk->sk_bound_dev_if) in inet_csk_bind_conflict() · d2c13561

Eric Dumazet authored May 13, 2022

inet_csk_bind_conflict() can access sk->sk_bound_dev_if for
unlocked sockets.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d2c13561

dccp: use READ_ONCE() to read sk->sk_bound_dev_if · 36f7cec4

Eric Dumazet authored May 13, 2022

When reading listener sk->sk_bound_dev_if locklessly,
we must use READ_ONCE().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

36f7cec4

net: core: add READ_ONCE/WRITE_ONCE annotations for sk->sk_bound_dev_if · e5fccaa1

Eric Dumazet authored May 13, 2022

sock_bindtoindex_locked() needs to use WRITE_ONCE(sk->sk_bound_dev_if, val),
because other cpus/threads might locklessly read this field.

sock_getbindtodevice(), sock_getsockopt() need READ_ONCE()
because they run without socket lock held.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e5fccaa1

tcp: sk->sk_bound_dev_if once in inet_request_bound_dev_if() · fdb5fd7f

Eric Dumazet authored May 13, 2022

inet_request_bound_dev_if() reads sk->sk_bound_dev_if twice
while listener socket is not locked.

Another cpu could change this field under us.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fdb5fd7f

sctp: read sk->sk_bound_dev_if once in sctp_rcv() · a20ea298

Eric Dumazet authored May 13, 2022

sctp_rcv() reads sk->sk_bound_dev_if twice while the socket
is not locked. Another cpu could change this field under us.

Fixes: 0fd9a65a ("[SCTP] Support SO_BINDTODEVICE socket option on incoming packets.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a20ea298

net: annotate races around sk->sk_bound_dev_if · 4c971d2f

Eric Dumazet authored May 13, 2022

UDP sendmsg() is lockless, and reads sk->sk_bound_dev_if while
this field can be changed by another thread.

Adds minimal annotations to avoid KCSAN splats for UDP.
Following patches will add more annotations to potential lockless readers.

BUG: KCSAN: data-race in __ip6_datagram_connect / udpv6_sendmsg

write to 0xffff888136d47a94 of 4 bytes by task 7681 on cpu 0:
 __ip6_datagram_connect+0x6e2/0x930 net/ipv6/datagram.c:221
 ip6_datagram_connect+0x2a/0x40 net/ipv6/datagram.c:272
 inet_dgram_connect+0x107/0x190 net/ipv4/af_inet.c:576
 __sys_connect_file net/socket.c:1900 [inline]
 __sys_connect+0x197/0x1b0 net/socket.c:1917
 __do_sys_connect net/socket.c:1927 [inline]
 __se_sys_connect net/socket.c:1924 [inline]
 __x64_sys_connect+0x3d/0x50 net/socket.c:1924
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x2b/0x50 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x44/0xae

read to 0xffff888136d47a94 of 4 bytes by task 7670 on cpu 1:
 udpv6_sendmsg+0xc60/0x16e0 net/ipv6/udp.c:1436
 inet6_sendmsg+0x5f/0x80 net/ipv6/af_inet6.c:652
 sock_sendmsg_nosec net/socket.c:705 [inline]
 sock_sendmsg net/socket.c:725 [inline]
 ____sys_sendmsg+0x39a/0x510 net/socket.c:2413
 ___sys_sendmsg net/socket.c:2467 [inline]
 __sys_sendmmsg+0x267/0x4c0 net/socket.c:2553
 __do_sys_sendmmsg net/socket.c:2582 [inline]
 __se_sys_sendmmsg net/socket.c:2579 [inline]
 __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2579
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x2b/0x50 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x44/0xae

value changed: 0x00000000 -> 0xffffff9b

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 7670 Comm: syz-executor.3 Tainted: G        W         5.18.0-rc1-syzkaller-dirty #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

I chose to not add Fixes: tag because race has minor consequences
and stable teams busy enough.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4c971d2f

Merge branch 'big-tcp' · 7fa2e481

David S. Miller authored May 16, 2022

Eric Dumazet says:

====================
tcp: BIG TCP implementation

This series implements BIG TCP as presented in netdev 0x15:

https://netdevconf.info/0x15/session.html?BIG-TCP

Jonathan Corbet made a nice summary: https://lwn.net/Articles/884104/

Standard TSO/GRO packet limit is 64KB

With BIG TCP, we allow bigger TSO/GRO packet sizes for IPv6 traffic.

Note that this feature is by default not enabled, because it might
break some eBPF programs assuming TCP header immediately follows IPv6 header.

While tcpdump recognizes the HBH/Jumbo header, standard pcap filters
are unable to skip over IPv6 extension headers.

Reducing number of packets traversing networking stack usually improves
performance, as shown on this experiment using a 100Gbit NIC, and 4K MTU.

'Standard' performance with current (74KB) limits.
for i in {1..10}; do ./netperf -t TCP_RR -H iroa23  -- -r80000,80000 -O MIN_LATENCY,P90_LATENCY,P99_LATENCY,THROUGHPUT|tail -1; done
77           138          183          8542.19
79           143          178          8215.28
70           117          164          9543.39
80           144          176          8183.71
78           126          155          9108.47
80           146          184          8115.19
71           113          165          9510.96
74           113          164          9518.74
79           137          178          8575.04
73           111          171          9561.73

Now enable BIG TCP on both hosts.

ip link set dev eth0 gro_max_size 185000 gso_max_size 185000
for i in {1..10}; do ./netperf -t TCP_RR -H iroa23  -- -r80000,80000 -O MIN_LATENCY,P90_LATENCY,P99_LATENCY,THROUGHPUT|tail -1; done
57           83           117          13871.38
64           118          155          11432.94
65           116          148          11507.62
60           105          136          12645.15
60           103          135          12760.34
60           102          134          12832.64
62           109          132          10877.68
58           82           115          14052.93
57           83           124          14212.58
57           82           119          14196.01

We see an increase of transactions per second, and lower latencies as well.

v7: adopt unsafe_memcpy() in mlx5 to avoid FORTIFY warnings.

v6: fix a compilation error for CONFIG_IPV6=n in
    "net: allow gso_max_size to exceed 65536", reported by kernel bots.

v5: Replaced two patches (that were adding new attributes) with patches
    from Alexander Duyck. Idea is to reuse existing gso_max_size/gro_max_size

v4: Rebased on top of Jakub series (Merge branch 'tso-gso-limit-split')
    max_tso_size is now family independent.

v3: Fixed a typo in RFC number (Alexander)
    Added Reviewed-by: tags from Tariq on mlx4/mlx5 parts.

v2: Removed the MAX_SKB_FRAGS change, this belongs to a different series.
    Addressed feedback, for Alexander and nvidia folks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

7fa2e481

mlx5: support BIG TCP packets · de78960e

Eric Dumazet authored May 13, 2022

mlx5 supports LSOv2.

IPv6 gro/tcp stacks insert a temporary Hop-by-Hop header
with JUMBO TLV for big packets.

We need to ignore/skip this HBH header when populating TX descriptor.

Note that ipv6_has_hopopt_jumbo() only recognizes very specific packet
layout, thus mlx5e_sq_xmit_wqe() is taking care of this layout only.

v7: adopt unsafe_memcpy() and MLX5_UNSAFE_MEMCPY_DISCLAIMER
v2: clear hopbyhop in mlx5e_tx_get_gso_ihs()
v4: fix compile error for CONFIG_MLX5_CORE_IPOIB=y
Signed-off-by: Coco Li <lixiaoyan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Cc: Saeed Mahameed <saeedm@nvidia.com>
Cc: Leon Romanovsky <leon@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

de78960e

mlx4: support BIG TCP packets · 1169a642

Eric Dumazet authored May 13, 2022

mlx4 supports LSOv2 just fine.

IPv6 stack inserts a temporary Hop-by-Hop header
with JUMBO TLV for big packets.

We need to ignore the HBH header when populating TX descriptor.

Tested:

Before: (not enabling bigger TSO/GRO packets)

ip link set dev eth0 gso_max_size 65536 gro_max_size 65536

netperf -H lpaa18 -t TCP_RR -T2,2 -l 10 -Cc -- -r 70000,70000
MIGRATED TCP REQUEST/RESPONSE TEST from ::0 (::) port 0 AF_INET6 to lpaa18.prod.google.com () port 0 AF_INET6 : first burst 0 : cpu bind
Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr

262144 540000 70000   70000  10.00   6591.45  0.86   1.34   62.490  97.446
262144 540000

After: (enabling bigger TSO/GRO packets)

ip link set dev eth0 gso_max_size 185000 gro_max_size 185000

netperf -H lpaa18 -t TCP_RR -T2,2 -l 10 -Cc -- -r 70000,70000
MIGRATED TCP REQUEST/RESPONSE TEST from ::0 (::) port 0 AF_INET6 to lpaa18.prod.google.com () port 0 AF_INET6 : first burst 0 : cpu bind
Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr

262144 540000 70000   70000  10.00   8383.95  0.95   1.01   54.432  57.584
262144 540000
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Acked-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1169a642