Commits · b2f750c3a80b285cd60c9346f8c96bd0a2a66cde · Kirill Smelkov / linux

11 Oct, 2023 3 commits

net: rfkill: gpio: prevent value glitch during probe · b2f750c3

Josua Mayer authored Oct 04, 2023

When either reset- or shutdown-gpio have are initially deasserted,
e.g. after a reboot - or when the hardware does not include pull-down,
there will be a short toggle of both IOs to logical 0 and back to 1.

It seems that the rfkill default is unblocked, so the driver should not
glitch to output low during probe.
It can lead e.g. to unexpected lte modem reconnect:

[1] root@localhost:~# dmesg | grep "usb 2-1"
[    2.136124] usb 2-1: new SuperSpeed USB device number 2 using xhci-hcd
[   21.215278] usb 2-1: USB disconnect, device number 2
[   28.833977] usb 2-1: new SuperSpeed USB device number 3 using xhci-hcd

The glitch has been discovered on an arm64 board, now that device-tree
support for the rfkill-gpio driver has finally appeared :).

Change the flags for devm_gpiod_get_optional from GPIOD_OUT_LOW to
GPIOD_ASIS to avoid any glitches.
The rfkill driver will set the intended value during rfkill_sync_work.

Fixes: 7176ba23 ("net: rfkill: add generic gpio rfkill driver")
Signed-off-by: Josua Mayer <josua@solid-run.com>
Link: https://lore.kernel.org/r/20231004163928.14609-1-josua@solid-run.comSigned-off-by: Johannes Berg <johannes.berg@intel.com>

b2f750c3

wifi: mac80211: fix error path key leak · 02e0e426

Johannes Berg authored Oct 05, 2023

In the previous key leak fix for the other error
paths, I meant to unify all of them to the same
place, but used the wrong label, which I noticed
when doing the merge into wireless-next. Fix it.

Fixes: d097ae01 ("wifi: mac80211: fix potential key leak")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>

02e0e426

wifi: cfg80211: use system_unbound_wq for wiphy work · 91d20ab9

Johannes Berg authored Oct 09, 2023

Since wiphy work items can run pretty much arbitrary
code in the stack/driver, it can take longer to run
all of this, so we shouldn't be using system_wq via
schedule_work(). Also, we lock the wiphy (which is
the reason this exists), so use system_unbound_wq.
Reported-and-tested-by: Kalle Valo <kvalo@kernel.org>
Fixes: a3ee4dc8 ("wifi: cfg80211: add a work abstraction with special semantics")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>

91d20ab9

05 Oct, 2023 20 commits

Merge tag 'net-6.6-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · f291209e

Linus Torvalds authored Oct 05, 2023

Pull networking fixes from Jakub Kicinski:
 "Including fixes from Bluetooth, netfilter, BPF and WiFi.

  I didn't collect precise data but feels like we've got a lot of 6.5
  fixes here. WiFi fixes are most user-awaited.

  Current release - regressions:

   - Bluetooth: fix hci_link_tx_to RCU lock usage

  Current release - new code bugs:

   - bpf: mprog: fix maximum program check on mprog attachment

   - eth: ti: icssg-prueth: fix signedness bug in prueth_init_tx_chns()

  Previous releases - regressions:

   - ipv6: tcp: add a missing nf_reset_ct() in 3WHS handling

   - vringh: don't use vringh_kiov_advance() in vringh_iov_xfer(), it
     doesn't handle zero length like we expected

   - wifi:
      - cfg80211: fix cqm_config access race, fix crashes with brcmfmac
      - iwlwifi: mvm: handle PS changes in vif_cfg_changed
      - mac80211: fix mesh id corruption on 32 bit systems
      - mt76: mt76x02: fix MT76x0 external LNA gain handling

   - Bluetooth: fix handling of HCI_QUIRK_STRICT_DUPLICATE_FILTER

   - l2tp: fix handling of transhdrlen in __ip{,6}_append_data()

   - dsa: mv88e6xxx: avoid EEPROM timeout when EEPROM is absent

   - eth: stmmac: fix the incorrect parameter after refactoring

  Previous releases - always broken:

   - net: replace calls to sock->ops->connect() with kernel_connect(),
     prevent address rewrite in kernel_bind(); otherwise BPF hooks may
     modify arguments, unexpectedly to the caller

   - tcp: fix delayed ACKs when reads and writes align with MSS

   - bpf:
      - verifier: unconditionally reset backtrack_state masks on global
        func exit
      - s390: let arch_prepare_bpf_trampoline return program size, fix
        struct_ops offsets
      - sockmap: fix accounting of available bytes in presence of PEEKs
      - sockmap: reject sk_msg egress redirects to non-TCP sockets

   - ipv4/fib: send netlink notify when delete source address routes

   - ethtool: plca: fix width of reads when parsing netlink commands

   - netfilter: nft_payload: rebuild vlan header on h_proto access

   - Bluetooth: hci_codec: fix leaking memory of local_codecs

   - eth: intel: ice: always add legacy 32byte RXDID in supported_rxdids

   - eth: stmmac:
     - dwmac-stm32: fix resume on STM32 MCU
     - remove buggy and unneeded stmmac_poll_controller, depend on NAPI

   - ibmveth: always recompute TCP pseudo-header checksum, fix use of
     the driver with Open vSwitch

   - wifi:
      - rtw88: rtw8723d: fix MAC address offset in EEPROM
      - mt76: fix lock dependency problem for wed_lock
      - mwifiex: sanity check data reported by the device
      - iwlwifi: ensure ack flag is properly cleared
      - iwlwifi: mvm: fix a memory corruption due to bad pointer arithm
      - iwlwifi: mvm: fix incorrect usage of scan API

  Misc:

   - wifi: mac80211: work around Cisco AP 9115 VHT MPDU length"

* tag 'net-6.6-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (99 commits)
  MAINTAINERS: update Matthieu's email address
  mptcp: userspace pm allow creating id 0 subflow
  mptcp: fix delegated action races
  net: stmmac: remove unneeded stmmac_poll_controller
  net: lan743x: also select PHYLIB
  net: ethernet: mediatek: disable irq before schedule napi
  net: mana: Fix oversized sge0 for GSO packets
  net: mana: Fix the tso_bytes calculation
  net: mana: Fix TX CQE error handling
  netlink: annotate data-races around sk->sk_err
  sctp: update hb timer immediately after users change hb_interval
  sctp: update transport state when processing a dupcook packet
  tcp: fix delayed ACKs for MSS boundary condition
  tcp: fix quick-ack counting to count actual ACKs of new data
  page_pool: fix documentation typos
  tipc: fix a potential deadlock on &tx->lock
  net: stmmac: dwmac-stm32: fix resume on STM32 MCU
  ipv4: Set offload_failed flag in fibmatch results
  netfilter: nf_tables: nft_set_rbtree: fix spurious insertion failure
  netfilter: nf_tables: Deduplicate nft_register_obj audit logs
  ...

f291209e

Merge tag 'integrity-v6.6-fix' of... · cb84fb87

Linus Torvalds authored Oct 05, 2023

Merge tag 'integrity-v6.6-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity

Pull integrity fixes from Mimi Zohar:
 "Two additional patches to fix the removal of the deprecated
  IMA_TRUSTED_KEYRING Kconfig"

* tag 'integrity-v6.6-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity:
  ima: rework CONFIG_IMA dependency block
  ima: Finish deprecation of IMA_TRUSTED_KEYRING Kconfig

cb84fb87

Merge tag 'leds-fixes-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/leds · e90822d7

Linus Torvalds authored Oct 05, 2023

Pull LED fix from Lee Jones:
 "Just the one bug-fix:

   - Fix regression affecting LED_COLOR_ID_MULTI users"

* tag 'leds-fixes-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/leds:
  leds: Drop BUG_ON check for LED_COLOR_ID_MULTI

e90822d7

Merge tag 'mfd-fixes-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd · bc622f16

Linus Torvalds authored Oct 05, 2023

Pull MFD fixes from Lee Jones:
 "A couple of small fixes:

   - Potential build failure in CS42L43

   - Device Tree bindings clean-up for a superseded patch"

* tag 'mfd-fixes-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd:
  dt-bindings: mfd: Revert "dt-bindings: mfd: maxim,max77693: Add USB connector"
  mfd: cs42l43: Fix MFD_CS42L43 dependency on REGMAP_IRQ

bc622f16

Merge tag 'ovl-fixes-6.6-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs · 403688e0

Linus Torvalds authored Oct 05, 2023

Pull overlayfs fixes from Amir Goldstein:

 - Fix for file reference leak regression

 - Fix for NULL pointer deref regression

 - Fixes for RCU-walk race regressions:

   Two of the fixes were taken from Al's RCU pathwalk race fixes series
   with his consent [1].

   Note that unlike most of Al's series, these two patches are not about
   racing with ->kill_sb() and they are also very recent regressions
   from v6.5, so I think it's worth getting them into v6.5.y.

   There is also a fix for an RCU pathwalk race with ->kill_sb(), which
   may have been solved in vfs generic code as you suggested, but it
   also rids overlayfs from a nasty hack, so I think it's worth anyway.

Link: https://lore.kernel.org/linux-fsdevel/20231003204749.GA800259@ZenIV/ [1]

* tag 'ovl-fixes-6.6-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs:
  ovl: fix NULL pointer defer when encoding non-decodable lower fid
  ovl: make use of ->layers safe in rcu pathwalk
  ovl: fetch inode once in ovl_dentry_revalidate_common()
  ovl: move freeing ovl_entry past rcu delay
  ovl: fix file reference leak when submitting aio

403688e0

Merge branch 'mptcp-fixes-and-maintainer-email-update-for-v6-6' · c29d9845

Jakub Kicinski authored Oct 05, 2023

Mat Martineau says:

====================
mptcp: Fixes and maintainer email update for v6.6

Patch 1 addresses a race condition in MPTCP "delegated actions"
infrastructure. Affects v5.19 and later.

Patch 2 removes an unnecessary restriction that did not allow additional
outgoing subflows using the local address of the initial MPTCP subflow.
v5.16 and later.

Patch 3 updates Matthieu's email address.
====================

Link: https://lore.kernel.org/r/20231004-send-net-20231004-v1-0-28de4ac663ae@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

c29d9845

MAINTAINERS: update Matthieu's email address · 8eed6ee3

Matthieu Baerts authored Oct 04, 2023

Use my kernel.org account instead.

The other one will bounce by the end of the year.
Signed-off-by: Matthieu Baerts <matttbe@kernel.org>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Link: https://lore.kernel.org/r/20231004-send-net-20231004-v1-3-28de4ac663ae@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

8eed6ee3

mptcp: userspace pm allow creating id 0 subflow · e5ed101a

Geliang Tang authored Oct 04, 2023

This patch drops id 0 limitation in mptcp_nl_cmd_sf_create() to allow
creating additional subflows with the local addr ID 0.

There is no reason not to allow additional subflows from this local
address: we should be able to create new subflows from the initial
endpoint. This limitation was breaking fullmesh support from userspace.

Fixes: 702c2f64 ("mptcp: netlink: allow userspace-driven subflow establishment")
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/391
Cc: stable@vger.kernel.org
Suggested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Link: https://lore.kernel.org/r/20231004-send-net-20231004-v1-2-28de4ac663ae@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

e5ed101a

mptcp: fix delegated action races · a5efdbce

Paolo Abeni authored Oct 04, 2023

The delegated action infrastructure is prone to the following
race: different CPUs can try to schedule different delegated
actions on the same subflow at the same time.

Each of them will check different bits via mptcp_subflow_delegate(),
and will try to schedule the action on the related per-cpu napi
instance.

Depending on the timing, both can observe an empty delegated list
node, causing the same entry to be added simultaneously on two different
lists.

The root cause is that the delegated actions infra does not provide
a single synchronization point. Address the issue reserving an additional
bit to mark the subflow as scheduled for delegation. Acquiring such bit
guarantee the caller to own the delegated list node, and being able to
safely schedule the subflow.

Clear such bit only when the subflow scheduling is completed, ensuring
proper barrier in place.

Additionally swap the meaning of the delegated_action bitmask, to allow
the usage of the existing helper to set multiple bit at once.

Fixes: bcd97734 ("mptcp: use delegate action to schedule 3rd ack retrans")
Cc: stable@vger.kernel.org
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Link: https://lore.kernel.org/r/20231004-send-net-20231004-v1-1-28de4ac663ae@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

a5efdbce

net: stmmac: remove unneeded stmmac_poll_controller · 3eef8555

Remi Pommarel authored Oct 04, 2023

Using netconsole netpoll_poll_dev could be called from interrupt
context, thus using disable_irq() would cause the following kernel
warning with CONFIG_DEBUG_ATOMIC_SLEEP enabled:

  BUG: sleeping function called from invalid context at kernel/irq/manage.c:137
  in_atomic(): 1, irqs_disabled(): 128, non_block: 0, pid: 10, name: ksoftirqd/0
  CPU: 0 PID: 10 Comm: ksoftirqd/0 Tainted: G        W         5.15.42-00075-g816b502b2298-dirty #117
  Hardware name: aml (r1) (DT)
  Call trace:
   dump_backtrace+0x0/0x270
   show_stack+0x14/0x20
   dump_stack_lvl+0x8c/0xac
   dump_stack+0x18/0x30
   ___might_sleep+0x150/0x194
   __might_sleep+0x64/0xbc
   synchronize_irq+0x8c/0x150
   disable_irq+0x2c/0x40
   stmmac_poll_controller+0x140/0x1a0
   netpoll_poll_dev+0x6c/0x220
   netpoll_send_skb+0x308/0x390
   netpoll_send_udp+0x418/0x760
   write_msg+0x118/0x140 [netconsole]
   console_unlock+0x404/0x500
   vprintk_emit+0x118/0x250
   dev_vprintk_emit+0x19c/0x1cc
   dev_printk_emit+0x90/0xa8
   __dev_printk+0x78/0x9c
   _dev_warn+0xa4/0xbc
   ath10k_warn+0xe8/0xf0 [ath10k_core]
   ath10k_htt_txrx_compl_task+0x790/0x7fc [ath10k_core]
   ath10k_pci_napi_poll+0x98/0x1f4 [ath10k_pci]
   __napi_poll+0x58/0x1f4
   net_rx_action+0x504/0x590
   _stext+0x1b8/0x418
   run_ksoftirqd+0x74/0xa4
   smpboot_thread_fn+0x210/0x3c0
   kthread+0x1fc/0x210
   ret_from_fork+0x10/0x20

Since [0] .ndo_poll_controller is only needed if driver doesn't or
partially use NAPI. Because stmmac does so, stmmac_poll_controller
can be removed fixing the above warning.

[0] commit ac3d9dd0 ("netpoll: make ndo_poll_controller() optional")

Cc: <stable@vger.kernel.org> # 5.15.x
Fixes: 47dd7a54 ("net: add support for STMicroelectronics Ethernet controllers")
Signed-off-by: Remi Pommarel <repk@triplefau.lt>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/1c156a6d8c9170bd6a17825f2277115525b4d50f.1696429960.git.repk@triplefau.ltSigned-off-by: Jakub Kicinski <kuba@kernel.org>

3eef8555

net: lan743x: also select PHYLIB · 566aeed6

Randy Dunlap authored Oct 02, 2023

Since FIXED_PHY depends on PHYLIB, PHYLIB needs to be set to avoid
a kconfig warning:

WARNING: unmet direct dependencies detected for FIXED_PHY
  Depends on [n]: NETDEVICES [=y] && PHYLIB [=n]
  Selected by [y]:
  - LAN743X [=y] && NETDEVICES [=y] && ETHERNET [=y] && NET_VENDOR_MICROCHIP [=y] && PCI [=y] && PTP_1588_CLOCK_OPTIONAL [=y]

Fixes: 73c4d1b3 ("net: lan743x: select FIXED_PHY")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: kernel test robot <lkp@intel.com>
Closes: lore.kernel.org/r/202309261802.JPbRHwti-lkp@intel.com
Cc: Bryan Whitehead <bryan.whitehead@microchip.com>
Cc: UNGLinuxDriver@microchip.com
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Simon Horman <horms@kernel.org> # build-tested
Link: https://lore.kernel.org/r/20231002193544.14529-1-rdunlap@infradead.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

566aeed6

net: ethernet: mediatek: disable irq before schedule napi · fcdfc462

Christian Marangi authored Oct 02, 2023

While searching for possible refactor of napi_schedule_prep and
__napi_schedule it was notice that the mtk eth driver disable the
interrupt for rx and tx AFTER napi is scheduled.

While this is a very hard to repro case it might happen to have
situation where the interrupt is disabled and never enabled again as the
napi completes and the interrupt is enabled before.

This is caused by the fact that a napi driven by interrupt expect a
logic with:
1. interrupt received. napi prepared -> interrupt disabled -> napi
   scheduled
2. napi triggered. ring cleared -> interrupt enabled -> wait for new
   interrupt

To prevent this case, disable the interrupt BEFORE the napi is
scheduled.

Fixes: 656e7052 ("net-next: mediatek: add support for MT7623 ethernet")
Cc: stable@vger.kernel.org
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Link: https://lore.kernel.org/r/20231002140805.568-1-ansuelsmth@gmail.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>

fcdfc462

Merge branch 'net-mana-fix-some-tx-processing-bugs' · defe4b87

Paolo Abeni authored Oct 05, 2023

Haiyang Zhang says:

====================
net: mana: Fix some TX processing bugs

Fix TX processing bugs on error handling, tso_bytes calculation,
and sge0 size.
====================

Link: https://lore.kernel.org/r/1696020147-14989-1-git-send-email-haiyangz@microsoft.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>

defe4b87

net: mana: Fix oversized sge0 for GSO packets · a43e8e9f

Haiyang Zhang authored Sep 29, 2023

Handle the case when GSO SKB linear length is too large.

MANA NIC requires GSO packets to put only the header part to SGE0,
otherwise the TX queue may stop at the HW level.

So, use 2 SGEs for the skb linear part which contains more than the
packet header.

Fixes: ca9c54d2 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)")
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

a43e8e9f

net: mana: Fix the tso_bytes calculation · 7a54de92

Haiyang Zhang authored Sep 29, 2023

sizeof(struct hop_jumbo_hdr) is not part of tso_bytes, so remove
the subtraction from header size.

Cc: stable@vger.kernel.org
Fixes: bd7fc6e1 ("net: mana: Add new MANA VF performance counters for easier troubleshooting")
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

7a54de92

net: mana: Fix TX CQE error handling · b2b00006

Haiyang Zhang authored Sep 29, 2023

For an unknown TX CQE error type (probably from a newer hardware),
still free the SKB, update the queue tail, etc., otherwise the
accounting will be wrong.

Also, TX errors can be triggered by injecting corrupted packets, so
replace the WARN_ONCE to ratelimited error logging.

Cc: stable@vger.kernel.org
Fixes: ca9c54d2 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)")
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

b2b00006

Merge tag 'rtla-v6.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bristot/linux · 3006adf3

Linus Torvalds authored Oct 04, 2023

Pull rtla fixes from Daniel Bristot de Oliveira:
 "rtla (Real-Time Linux Analysis) tool fixes.

  Timerlat auto-analysis:

   - Timerlat is reporting thread interference time without thread noise
     events occurrence. It was caused because the thread interference
     variable was not reset after the analysis of a timerlat activation
     that did not hit the threshold.

   - The IRQ handler delay is estimated from the delta of the IRQ
     latency reported by timerlat, and the timestamp from IRQ handler
     start event. If the delta is near-zero, the drift from the external
     clock and the trace event and/or the overhead can cause the value
     to be negative. If the value is negative, print a zero-delay.

   - IRQ handlers happening after the timerlat thread event but before
     the stop tracing were being reported as IRQ that happened before
     the *current* IRQ occurrence. Ignore Previous IRQ noise in this
     condition because they are valid only for the *next* timerlat
     activation.

  Timerlat user-space:

   - Timerlat is stopping all user-space thread if a CPU becomes
     offline. Do not stop the entire tool if a CPU is/become offline,
     but only the thread of the unavailable CPU. Stop the tool only, if
     all threads leave because the CPUs become/are offline.

  man-pages:

   - Fix command line example in timerlat hist man page"

* tag 'rtla-v6.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bristot/linux:
  rtla: fix a example in rtla-timerlat-hist.rst
  rtla/timerlat: Do not stop user-space if a cpu is offline
  rtla/timerlat_aa: Fix previous IRQ delay for IRQs that happens after thread sample
  rtla/timerlat_aa: Fix negative IRQ delay
  rtla/timerlat_aa: Zero thread sum after every sample analysis

3006adf3

netlink: annotate data-races around sk->sk_err · d0f95894

Eric Dumazet authored Oct 03, 2023

syzbot caught another data-race in netlink when
setting sk->sk_err.

Annotate all of them for good measure.

BUG: KCSAN: data-race in netlink_recvmsg / netlink_recvmsg

write to 0xffff8881613bb220 of 4 bytes by task 28147 on cpu 0:
netlink_recvmsg+0x448/0x780 net/netlink/af_netlink.c:1994
sock_recvmsg_nosec net/socket.c:1027 [inline]
sock_recvmsg net/socket.c:1049 [inline]
__sys_recvfrom+0x1f4/0x2e0 net/socket.c:2229
__do_sys_recvfrom net/socket.c:2247 [inline]
__se_sys_recvfrom net/socket.c:2243 [inline]
__x64_sys_recvfrom+0x78/0x90 net/socket.c:2243
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

write to 0xffff8881613bb220 of 4 bytes by task 28146 on cpu 1:
netlink_recvmsg+0x448/0x780 net/netlink/af_netlink.c:1994
sock_recvmsg_nosec net/socket.c:1027 [inline]
sock_recvmsg net/socket.c:1049 [inline]
__sys_recvfrom+0x1f4/0x2e0 net/socket.c:2229
__do_sys_recvfrom net/socket.c:2247 [inline]
__se_sys_recvfrom net/socket.c:2243 [inline]
__x64_sys_recvfrom+0x78/0x90 net/socket.c:2243
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

value changed: 0x00000000 -> 0x00000016

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 28146 Comm: syz-executor.0 Not tainted 6.6.0-rc3-syzkaller-00055-g9ed22ae6 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/06/2023

Fixes: 1da177e4 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20231003183455.3410550-1-edumazet@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

d0f95894

sctp: update hb timer immediately after users change hb_interval · 1f4e803c

Xin Long authored Oct 01, 2023

Currently, when hb_interval is changed by users, it won't take effect
until the next expiry of hb timer. As the default value is 30s, users
have to wait up to 30s to wait its hb_interval update to work.

This becomes pretty bad in containers where a much smaller value is
usually set on hb_interval. This patch improves it by resetting the
hb timer immediately once the value of hb_interval is updated by users.

Note that we don't address the already existing 'problem' when sending
a heartbeat 'on demand' if one hb has just been sent(from the timer)
mentioned in:

https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg590224.htmlSigned-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Fixes: 1da177e4 ("Linux-2.6.12-rc2")
Link: https://lore.kernel.org/r/75465785f8ee5df2fb3acdca9b8fafdc18984098.1696172660.git.lucien.xin@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

1f4e803c

sctp: update transport state when processing a dupcook packet · 2222a780

Xin Long authored Oct 01, 2023

During the 4-way handshake, the transport's state is set to ACTIVE in
sctp_process_init() when processing INIT_ACK chunk on client or
COOKIE_ECHO chunk on server.

In the collision scenario below:

192.168.1.2 > 192.168.1.1: sctp (1) [INIT] [init tag: 3922216408]
192.168.1.1 > 192.168.1.2: sctp (1) [INIT] [init tag: 144230885]
192.168.1.2 > 192.168.1.1: sctp (1) [INIT ACK] [init tag: 3922216408]
192.168.1.1 > 192.168.1.2: sctp (1) [COOKIE ECHO]
192.168.1.2 > 192.168.1.1: sctp (1) [COOKIE ACK]
192.168.1.1 > 192.168.1.2: sctp (1) [INIT ACK] [init tag: 3914796021]

when processing COOKIE_ECHO on 192.168.1.2, as it's in COOKIE_WAIT state,
sctp_sf_do_dupcook_b() is called by sctp_sf_do_5_2_4_dupcook() where it
creates a new association and sets its transport to ACTIVE then updates
to the old association in sctp_assoc_update().

However, in sctp_assoc_update(), it will skip the transport update if it
finds a transport with the same ipaddr already existing in the old asoc,
and this causes the old asoc's transport state not to move to ACTIVE
after the handshake.

This means if DATA retransmission happens at this moment, it won't be able
to enter PF state because of the check 'transport->state == SCTP_ACTIVE'
in sctp_do_8_2_transport_strike().

This patch fixes it by updating the transport in sctp_assoc_update() with
sctp_assoc_add_peer() where it updates the transport state if there is
already a transport with the same ipaddr exists in the old asoc.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Fixes: 1da177e4 ("Linux-2.6.12-rc2")
Link: https://lore.kernel.org/r/fd17356abe49713ded425250cc1ae51e9f5846c6.1696172325.git.lucien.xin@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

2222a780

04 Oct, 2023 17 commits

tcp: fix delayed ACKs for MSS boundary condition · 4720852e

Neal Cardwell authored Oct 01, 2023

This commit fixes poor delayed ACK behavior that can cause poor TCP
latency in a particular boundary condition: when an application makes
a TCP socket write that is an exact multiple of the MSS size.

The problem is that there is painful boundary discontinuity in the
current delayed ACK behavior. With the current delayed ACK behavior,
we have:

(1) If an app reads data when > 1*MSS is unacknowledged, then
    tcp_cleanup_rbuf() ACKs immediately because of:

     tp->rcv_nxt - tp->rcv_wup > icsk->icsk_ack.rcv_mss ||

(2) If an app reads all received data, and the packets were < 1*MSS,
    and either (a) the app is not ping-pong or (b) we received two
    packets < 1*MSS, then tcp_cleanup_rbuf() ACKs immediately beecause
    of:

     ((icsk->icsk_ack.pending & ICSK_ACK_PUSHED2) ||
      ((icsk->icsk_ack.pending & ICSK_ACK_PUSHED) &&
       !inet_csk_in_pingpong_mode(sk))) &&

(3) *However*: if an app reads exactly 1*MSS of data,
    tcp_cleanup_rbuf() does not send an immediate ACK. This is true
    even if the app is not ping-pong and the 1*MSS of data had the PSH
    bit set, suggesting the sending application completed an
    application write.

Thus if the app is not ping-pong, we have this painful case where
>1*MSS gets an immediate ACK, and <1*MSS gets an immediate ACK, but a
write whose last skb is an exact multiple of 1*MSS can get a 40ms
delayed ACK. This means that any app that transfers data in one
direction and takes care to align write size or packet size with MSS
can suffer this problem. With receive zero copy making 4KB MSS values
more common, it is becoming more common to have application writes
naturally align with MSS, and more applications are likely to
encounter this delayed ACK problem.

The fix in this commit is to refine the delayed ACK heuristics with a
simple check: immediately ACK a received 1*MSS skb with PSH bit set if
the app reads all data. Why? If an skb has a len of exactly 1*MSS and
has the PSH bit set then it is likely the end of an application
write. So more data may not be arriving soon, and yet the data sender
may be waiting for an ACK if cwnd-bound or using TX zero copy. Thus we
set ICSK_ACK_PUSHED in this case so that tcp_cleanup_rbuf() will send
an ACK immediately if the app reads all of the data and is not
ping-pong. Note that this logic is also executed for the case where
len > MSS, but in that case this logic does not matter (and does not
hurt) because tcp_cleanup_rbuf() will always ACK immediately if the
app reads data and there is more than an MSS of unACKed data.

Fixes: 1da177e4 ("Linux-2.6.12-rc2")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Cc: Xin Guo <guoxin0309@gmail.com>
Link: https://lore.kernel.org/r/20231001151239.1866845-2-ncardwell.sw@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

4720852e

tcp: fix quick-ack counting to count actual ACKs of new data · 059217c1

Neal Cardwell authored Oct 01, 2023

This commit fixes quick-ack counting so that it only considers that a
quick-ack has been provided if we are sending an ACK that newly
acknowledges data.

The code was erroneously using the number of data segments in outgoing
skbs when deciding how many quick-ack credits to remove. This logic
does not make sense, and could cause poor performance in
request-response workloads, like RPC traffic, where requests or
responses can be multi-segment skbs.

When a TCP connection decides to send N quick-acks, that is to
accelerate the cwnd growth of the congestion control module
controlling the remote endpoint of the TCP connection. That quick-ack
decision is purely about the incoming data and outgoing ACKs. It has
nothing to do with the outgoing data or the size of outgoing data.

And in particular, an ACK only serves the intended purpose of allowing
the remote congestion control to grow the congestion window quickly if
the ACK is ACKing or SACKing new data.

The fix is simple: only count packets as serving the goal of the
quickack mechanism if they are ACKing/SACKing new data. We can tell
whether this is the case by checking inet_csk_ack_scheduled(), since
we schedule an ACK exactly when we are ACKing/SACKing new data.

Fixes: fc6415bc ("[TCP]: Fix quick-ack decrementing with TSO.")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20231001151239.1866845-1-ncardwell.sw@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

059217c1

Merge tag 'nf-23-10-04' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf · c56e67f3

Jakub Kicinski authored Oct 04, 2023

Florian Westphal says:

====================
netfilter patches for net

First patch resolves a regression with vlan header matching, this was
broken since 6.5 release.  From myself.

Second patch fixes an ancient problem with sctp connection tracking in
case INIT_ACK packets are delayed.  This comes with a selftest, both
patches from Xin Long.

Patch 4 extends the existing nftables audit selftest, from
Phil Sutter.

Patch 5, also from Phil, avoids a situation where nftables
would emit an audit record twice. This was broken since 5.13 days.

Patch 6, from myself, avoids spurious insertion failure if we encounter an
overlapping but expired range during element insertion with the
'nft_set_rbtree' backend. This problem exists since 6.2.

* tag 'nf-23-10-04' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: nf_tables: nft_set_rbtree: fix spurious insertion failure
  netfilter: nf_tables: Deduplicate nft_register_obj audit logs
  selftests: netfilter: Extend nft_audit.sh
  selftests: netfilter: test for sctp collision processing in nf_conntrack
  netfilter: handle the connecting collision properly in nf_conntrack_proto_sctp
  netfilter: nft_payload: rebuild vlan header on h_proto access
====================

Link: https://lore.kernel.org/r/20231004141405.28749-1-fw@strlen.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>

c56e67f3

page_pool: fix documentation typos · 513dbc10

Randy Dunlap authored Sep 30, 2023

Correct grammar for better readability.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Link: https://lore.kernel.org/r/20231001003846.29541-1-rdunlap@infradead.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

513dbc10

tipc: fix a potential deadlock on &tx->lock · 08e50cf0

Chengfeng Ye authored Sep 27, 2023

It seems that tipc_crypto_key_revoke() could be be invoked by
wokequeue tipc_crypto_work_rx() under process context and
timer/rx callback under softirq context, thus the lock acquisition
on &tx->lock seems better use spin_lock_bh() to prevent possible
deadlock.

This flaw was found by an experimental static analysis tool I am
developing for irq-related deadlock.

tipc_crypto_work_rx() <workqueue>
--> tipc_crypto_key_distr()
--> tipc_bcast_xmit()
--> tipc_bcbase_xmit()
--> tipc_bearer_bc_xmit()
--> tipc_crypto_xmit()
--> tipc_ehdr_build()
--> tipc_crypto_key_revoke()
--> spin_lock(&tx->lock)
<timer interrupt>
   --> tipc_disc_timeout()
   --> tipc_bearer_xmit_skb()
   --> tipc_crypto_xmit()
   --> tipc_ehdr_build()
   --> tipc_crypto_key_revoke()
   --> spin_lock(&tx->lock) <deadlock here>
Signed-off-by: Chengfeng Ye <dg573847474@gmail.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Fixes: fc1b6d6d ("tipc: introduce TIPC encryption & authentication")
Link: https://lore.kernel.org/r/20230927181414.59928-1-dg573847474@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

08e50cf0

net: stmmac: dwmac-stm32: fix resume on STM32 MCU · 6f195d6b

Ben Wolsieffer authored Sep 27, 2023

The STM32MP1 keeps clk_rx enabled during suspend, and therefore the
driver does not enable the clock in stm32_dwmac_init() if the device was
suspended. The problem is that this same code runs on STM32 MCUs, which
do disable clk_rx during suspend, causing the clock to never be
re-enabled on resume.

This patch adds a variant flag to indicate that clk_rx remains enabled
during suspend, and uses this to decide whether to enable the clock in
stm32_dwmac_init() if the device was suspended.

This approach fixes this specific bug with limited opportunity for
unintended side-effects, but I have a follow up patch that will refactor
the clock configuration and hopefully make it less error prone.

Fixes: 6528e02c ("net: ethernet: stmmac: add adaptation for stm32mp157c.")
Signed-off-by: Ben Wolsieffer <ben.wolsieffer@hefring.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20230927175749.1419774-1-ben.wolsieffer@hefring.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

6f195d6b

ipv4: Set offload_failed flag in fibmatch results · 0add5c59

Benjamin Poirier authored Sep 26, 2023

Due to a small omission, the offload_failed flag is missing from ipv4
fibmatch results. Make sure it is set correctly.

The issue can be witnessed using the following commands:
echo "1 1" > /sys/bus/netdevsim/new_device
ip link add dummy1 up type dummy
ip route add 192.0.2.0/24 dev dummy1
echo 1 > /sys/kernel/debug/netdevsim/netdevsim1/fib/fail_route_offload
ip route add 198.51.100.0/24 dev dummy1
ip route
	# 192.168.15.0/24 has rt_trap
	# 198.51.100.0/24 has rt_offload_failed
ip route get 192.168.15.1 fibmatch
	# Result has rt_trap
ip route get 198.51.100.1 fibmatch
	# Result differs from the route shown by `ip route`, it is missing
	# rt_offload_failed
ip link del dev dummy1
echo 1 > /sys/bus/netdevsim/del_device

Fixes: 36c5100e ("IPv4: Add "offload failed" indication to routes")
Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230926182730.231208-1-bpoirier@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

0add5c59

Merge tag 'linux-kselftest-fixes-6.6-rc5' of... · ba7d997a

Linus Torvalds authored Oct 04, 2023

Merge tag 'linux-kselftest-fixes-6.6-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest

Pull kselftest fix from Shuah Khan:
 "One single fix to Makefile to fix the incorrect TARGET name for uevent
  test"

* tag 'linux-kselftest-fixes-6.6-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
  selftests: Fix wrong TARGET in kselftest top level Makefile

ba7d997a

Merge tag 'wireless-2023-09-27' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless · 72897b29

Jakub Kicinski authored Oct 04, 2023

Johannes Berg says:

====================

Quite a collection of fixes this time, really too many
to list individually. Many stack fixes, even rfkill
(found by simulation and the new eevdf scheduler)!

Also a bigger maintainers file cleanup, to remove old
and redundant information.

* tag 'wireless-2023-09-27' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless: (32 commits)
  wifi: iwlwifi: mvm: Fix incorrect usage of scan API
  wifi: mac80211: Create resources for disabled links
  wifi: cfg80211: avoid leaking stack data into trace
  wifi: mac80211: allow transmitting EAPOL frames with tainted key
  wifi: mac80211: work around Cisco AP 9115 VHT MPDU length
  wifi: cfg80211: Fix 6GHz scan configuration
  wifi: mac80211: fix potential key leak
  wifi: mac80211: fix potential key use-after-free
  wifi: mt76: mt76x02: fix MT76x0 external LNA gain handling
  wifi: brcmfmac: Replace 1-element arrays with flexible arrays
  wifi: mwifiex: Fix oob check condition in mwifiex_process_rx_packet
  wifi: rtw88: rtw8723d: Fix MAC address offset in EEPROM
  rfkill: sync before userspace visibility/changes
  wifi: mac80211: fix mesh id corruption on 32 bit systems
  wifi: cfg80211: add missing kernel-doc for cqm_rssi_work
  wifi: cfg80211: fix cqm_config access race
  wifi: iwlwifi: mvm: Fix a memory corruption issue
  wifi: iwlwifi: Ensure ack flag is properly cleared.
  wifi: iwlwifi: dbg_ini: fix structure packing
  iwlwifi: mvm: handle PS changes in vif_cfg_changed
  ...
====================

Link: https://lore.kernel.org/r/20230927095835.25803-2-johannes@sipsolutions.netSigned-off-by: Jakub Kicinski <kuba@kernel.org>

72897b29

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · 1eb3dee1

Jakub Kicinski authored Oct 04, 2023

Daniel Borkmann says:

====================
pull-request: bpf 2023-10-02

We've added 11 non-merge commits during the last 12 day(s) which contain
a total of 12 files changed, 176 insertions(+), 41 deletions(-).

The main changes are:

1) Fix BPF verifier to reset backtrack_state masks on global function
   exit as otherwise subsequent precision tracking would reuse them,
   from Andrii Nakryiko.

2) Several sockmap fixes for available bytes accounting,
   from John Fastabend.

3) Reject sk_msg egress redirects to non-TCP sockets given this
   is only supported for TCP sockets today, from Jakub Sitnicki.

4) Fix a syzkaller splat in bpf_mprog when hitting maximum program
   limits with BPF_F_BEFORE directive, from Daniel Borkmann
   and Nikolay Aleksandrov.

5) Fix BPF memory allocator to use kmalloc_size_roundup() to adjust
   size_index for selecting a bpf_mem_cache, from Hou Tao.

6) Fix arch_prepare_bpf_trampoline return code for s390 JIT,
   from Song Liu.

7) Fix bpf_trampoline_get when CONFIG_BPF_JIT is turned off,
   from Leon Hwang.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  bpf: Use kmalloc_size_roundup() to adjust size_index
  selftest/bpf: Add various selftests for program limits
  bpf, mprog: Fix maximum program check on mprog attachment
  bpf, sockmap: Reject sk_msg egress redirects to non-TCP sockets
  bpf, sockmap: Add tests for MSG_F_PEEK
  bpf, sockmap: Do not inc copied_seq when PEEK flag set
  bpf: tcp_read_skb needs to pop skb regardless of seq
  bpf: unconditionally reset backtrack_state masks on global func exit
  bpf: Fix tr dereferencing
  selftests/bpf: Check bpf_cubic_acked() is called via struct_ops
  s390/bpf: Let arch_prepare_bpf_trampoline return program size
====================

Link: https://lore.kernel.org/r/20231002113417.2309-1-daniel@iogearbox.netSigned-off-by: Jakub Kicinski <kuba@kernel.org>

1eb3dee1

netfilter: nf_tables: nft_set_rbtree: fix spurious insertion failure · 08738827

Florian Westphal authored Sep 28, 2023

nft_rbtree_gc_elem() walks back and removes the end interval element that
comes before the expired element.

There is a small chance that we've cached this element as 'rbe_ge'.
If this happens, we hold and test a pointer that has been queued for
freeing.

It also causes spurious insertion failures:

$ cat test-testcases-sets-0044interval_overlap_0.1/testout.log
Error: Could not process rule: File exists
add element t s {  0 -  2 }
                   ^^^^^^
Failed to insert  0 -  2 given:
table ip t {
        set s {
                type inet_service
                flags interval,timeout
                timeout 2s
                gc-interval 2s
        }
}

The set (rbtree) is empty. The 'failure' doesn't happen on next attempt.

Reason is that when we try to insert, the tree may hold an expired
element that collides with the range we're adding.
While we do evict/erase this element, we can trip over this check:

if (rbe_ge && nft_rbtree_interval_end(rbe_ge) && nft_rbtree_interval_end(new))
      return -ENOTEMPTY;

rbe_ge was erased by the synchronous gc, we should not have done this
check.  Next attempt won't find it, so retry results in successful
insertion.

Restart in-kernel to avoid such spurious errors.

Such restart are rare, unless userspace intentionally adds very large
numbers of elements with very short timeouts while setting a huge
gc interval.

Even in this case, this cannot loop forever, on each retry an existing
element has been removed.

As the caller is holding the transaction mutex, its impossible
for a second entity to add more expiring elements to the tree.

After this it also becomes feasible to remove the async gc worker
and perform all garbage collection from the commit path.

Fixes: c9e6978e ("netfilter: nft_set_rbtree: Switch to node list walk for overlap detection")
Signed-off-by: Florian Westphal <fw@strlen.de>

08738827

netfilter: nf_tables: Deduplicate nft_register_obj audit logs · 0d880dc6

Phil Sutter authored Sep 23, 2023

When adding/updating an object, the transaction handler emits suitable
audit log entries already, the one in nft_obj_notify() is redundant. To
fix that (and retain the audit logging from objects' 'update' callback),
Introduce an "audit log free" variant for internal use.

Fixes: c520292f ("audit: log nftables configuration change events once per table")
Signed-off-by: Phil Sutter <phil@nwl.cc>
Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Acked-by: Paul Moore <paul@paul-moore.com> (Audit)
Signed-off-by: Florian Westphal <fw@strlen.de>

0d880dc6

selftests: netfilter: Extend nft_audit.sh · 203bb9d3

Phil Sutter authored Sep 23, 2023

Add tests for sets and elements and deletion of all kinds. Also
reorder rule reset tests: By moving the bulk rule add command up, the
two 'reset rules' tests become identical.

While at it, fix for a failing bulk rule add test's error status getting
lost due to its use in a pipe. Avoid this by using a temporary file.

Headings in diff output for failing tests contain no useful data, strip
them.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Florian Westphal <fw@strlen.de>

203bb9d3

selftests: netfilter: test for sctp collision processing in nf_conntrack · cf791b22

Xin Long authored Oct 03, 2023

This patch adds a test case to reproduce the SCTP DATA chunk retransmission
timeout issue caused by the improper SCTP collision processing in netfilter
nf_conntrack_proto_sctp.

In this test, client sends a INIT chunk, but the INIT_ACK replied from
server is delayed until the server sends a INIT chunk to start a new
connection from its side. After the connection is complete from server
side, the delayed INIT_ACK arrives in nf_conntrack_proto_sctp.

The delayed INIT_ACK should be dropped in nf_conntrack_proto_sctp instead
of updating the vtag with the out-of-date init_tag, otherwise, the vtag
in DATA chunks later sent by client don't match the vtag in the conntrack
entry and the DATA chunks get dropped.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>

cf791b22

netfilter: handle the connecting collision properly in nf_conntrack_proto_sctp · 8e56b063

Xin Long authored Oct 03, 2023

In Scenario A and B below, as the delayed INIT_ACK always changes the peer
vtag, SCTP ct with the incorrect vtag may cause packet loss.

Scenario A: INIT_ACK is delayed until the peer receives its own INIT_ACK

  192.168.1.2 > 192.168.1.1: [INIT] [init tag: 1328086772]
    192.168.1.1 > 192.168.1.2: [INIT] [init tag: 1414468151]
    192.168.1.2 > 192.168.1.1: [INIT ACK] [init tag: 1328086772]
  192.168.1.1 > 192.168.1.2: [INIT ACK] [init tag: 1650211246] *
  192.168.1.2 > 192.168.1.1: [COOKIE ECHO]
    192.168.1.1 > 192.168.1.2: [COOKIE ECHO]
    192.168.1.2 > 192.168.1.1: [COOKIE ACK]

Scenario B: INIT_ACK is delayed until the peer completes its own handshake

  192.168.1.2 > 192.168.1.1: sctp (1) [INIT] [init tag: 3922216408]
    192.168.1.1 > 192.168.1.2: sctp (1) [INIT] [init tag: 144230885]
    192.168.1.2 > 192.168.1.1: sctp (1) [INIT ACK] [init tag: 3922216408]
    192.168.1.1 > 192.168.1.2: sctp (1) [COOKIE ECHO]
    192.168.1.2 > 192.168.1.1: sctp (1) [COOKIE ACK]
  192.168.1.1 > 192.168.1.2: sctp (1) [INIT ACK] [init tag: 3914796021] *

This patch fixes it as below:

In SCTP_CID_INIT processing:
- clear ct->proto.sctp.init[!dir] if ct->proto.sctp.init[dir] &&
  ct->proto.sctp.init[!dir]. (Scenario E)
- set ct->proto.sctp.init[dir].

In SCTP_CID_INIT_ACK processing:
- drop it if !ct->proto.sctp.init[!dir] && ct->proto.sctp.vtag[!dir] &&
  ct->proto.sctp.vtag[!dir] != ih->init_tag. (Scenario B, Scenario C)
- drop it if ct->proto.sctp.init[dir] && ct->proto.sctp.init[!dir] &&
  ct->proto.sctp.vtag[!dir] != ih->init_tag. (Scenario A)

In SCTP_CID_COOKIE_ACK processing:
- clear ct->proto.sctp.init[dir] and ct->proto.sctp.init[!dir].
  (Scenario D)

Also, it's important to allow the ct state to move forward with cookie_echo
and cookie_ack from the opposite dir for the collision scenarios.

There are also other Scenarios where it should allow the packet through,
addressed by the processing above:

Scenario C: new CT is created by INIT_ACK.

Scenario D: start INIT on the existing ESTABLISHED ct.

Scenario E: start INIT after the old collision on the existing ESTABLISHED
ct.

  192.168.1.2 > 192.168.1.1: sctp (1) [INIT] [init tag: 3922216408]
  192.168.1.1 > 192.168.1.2: sctp (1) [INIT] [init tag: 144230885]
  (both side are stopped, then start new connection again in hours)
  192.168.1.2 > 192.168.1.1: sctp (1) [INIT] [init tag: 242308742]

Fixes: 9fb9cbb1 ("[NETFILTER]: Add nf_conntrack subsystem.")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>

8e56b063

netfilter: nft_payload: rebuild vlan header on h_proto access · af84f9e4

Florian Westphal authored Sep 29, 2023

nft can perform merging of adjacent payload requests.
This means that:

ether saddr 00:11 ... ether type 8021ad ...

is a single payload expression, for 8 bytes, starting at the
ethernet source offset.

Check that offset+length is fully within the source/destination mac
addersses.

This bug prevents 'ether type' from matching the correct h_proto in case
vlan tag got stripped.

Fixes: de6843be ("netfilter: nft_payload: rebuild vlan header when needed")
Reported-by: David Ward <david.ward@ll.mit.edu>
Signed-off-by: Florian Westphal <fw@strlen.de>

af84f9e4

ibmveth: Remove condition to recompute TCP header checksum. · 51e7a666

David Wilder authored Sep 26, 2023

In some OVS environments the TCP pseudo header checksum may need to be
recomputed. Currently this is only done when the interface instance is
configured for "Trunk Mode". We found the issue also occurs in some
Kubernetes environments, these environments do not use "Trunk Mode",
therefor the condition is removed.

Performance tests with this change show only a fractional decrease in
throughput (< 0.2%).

Fixes: 7525de25 ("ibmveth: Set CHECKSUM_PARTIAL if NULL TCP CSUM.")
Signed-off-by: David Wilder <dwilder@us.ibm.com>
Reviewed-by: Nick Child <nnac123@linux.ibm.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

51e7a666