Commits · 5b3b51a181fdf0387d931a6cedab30921147e576 · Kirill Smelkov / linux

26 Aug, 2022 7 commits

Merge branch 'prestera-matchall' · 5b3b51a1

David S. Miller authored Aug 26, 2022

Maksym Glubokiy says:

====================
net: prestera: matchall features

This patch series extracts matchall rules management out of SPAN API
implementation and adds 2 features on top of that:
  - support for egress traffic (mirred egress action)
  - proper rule priorities management between matchall and flower
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

5b3b51a1

net: prestera: manage matchall and flower priorities · 44af9571

Maksym Glubokiy authored Aug 23, 2022

matchall rules can be added only to chain 0 and their priorities have
limitations:
 - new matchall ingress rule's priority must be higher (lower value)
   than any existing flower rule;
 - new matchall egress rule's priority must be lower (higher value)
   than any existing flower rule.

The opposite works for flower rule adding:
 - new flower ingress rule's priority must be lower (higher value)
   than any existing matchall rule;
 - new flower egress rule's priority must be higher (lower value)
   than any existing matchall rule.

This is a hardware limitation and thus must be properly handled in
driver by reporting errors to the user when newly added rule has such a
priority that cannot be installed into the hardware.

To achieve this, the driver must maintain both min/max matchall
priorities for every flower block when user adds/deletes a matchall
rule, as well as both min/max flower priorities for chain 0 for every
adding/deletion of flower rules for chain 0.

Cc: Serhiy Boiko <serhiy.boiko@plvision.eu>
Signed-off-by: Maksym Glubokiy <maksym.glubokiy@plvision.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>

44af9571

net: prestera: add support for egress traffic mirroring · 8c448c2b

Serhiy Boiko authored Aug 23, 2022

This enables adding matchall rules for egress:

  tc filter add .. egress .. matchall skip_sw \
    action mirred egress mirror dev ..
Signed-off-by: Serhiy Boiko <serhiy.boiko@plvision.eu>
Signed-off-by: Maksym Glubokiy <maksym.glubokiy@plvision.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>

8c448c2b

net: prestera: acl: extract matchall logic into a separate file · 8afd552d

Serhiy Boiko authored Aug 23, 2022

This commit adds more clarity to handling of TC_CLSMATCHALL_REPLACE and
TC_CLSMATCHALL_DESTROY events by calling newly added *_mall_*() handlers
instead of directly calling SPAN API.

This also extracts matchall rules management out of SPAN API since SPAN
is a hardware module which is used to implement 'matchall egress mirred'
action only.
Signed-off-by: Taras Chornyi <tchornyi@marvell.com>
Signed-off-by: Serhiy Boiko <serhiy.boiko@plvision.eu>
Signed-off-by: Maksym Glubokiy <maksym.glubokiy@plvision.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>

8afd552d

net: asix: ax88772: add ethtool pause configuration · 6661918c

Oleksij Rempel authored Aug 22, 2022

Add phylink based ethtool pause configuration
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

6661918c

net: asix: ax88772: migrate to phylink · e0bffe3e

Oleksij Rempel authored Aug 22, 2022

There are some exotic ax88772 based devices which may require
functionality provide by the phylink framework. For example:
- US100A20SFP, USB 2.0 auf LWL Converter with SFP Cage
- AX88772B USB to 100Base-TX Ethernet (with RMII) demo board, where it
  is possible to switch between internal PHY and external RMII based
  connection.

So, convert this driver to phylink as soon as possible.

Tested with:
- AX88772A + internal PHY
- AX88772B + external DP83TD510E T1L PHY
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

e0bffe3e

dt-bindings: net: Add missing (unevaluated|additional)Properties on child nodes · 057062ad

Rob Herring authored Aug 25, 2022

In order to ensure only documented properties are present, node schemas
must have unevaluatedProperties or additionalProperties set to false
(typically). Add missing properties/$refs as exposed by this addition.
Signed-off-by: Rob Herring <robh@kernel.org>
Link: https://lore.kernel.org/r/20220825192609.1538463-1-robh@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

057062ad

25 Aug, 2022 33 commits

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 880b0dd9

Jakub Kicinski authored Aug 25, 2022

drivers/net/ethernet/mellanox/mlx5/core/en_fs.c
21234e3a ("net/mlx5e: Fix use after free in mlx5e_fs_init()")
c7eafc5e ("net/mlx5e: Convert ethtool_steering member of flow_steering struct to pointer")
https://lore.kernel.org/all/20220825104410.67d4709c@canb.auug.org.au/
https://lore.kernel.org/all/20220823055533.334471-1-saeed@kernel.org/Signed-off-by: Jakub Kicinski <kuba@kernel.org>

880b0dd9

netdev: Use try_cmpxchg in napi_if_scheduled_mark_missed · b9030780

Uros Bizjak authored Aug 22, 2022

Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in
napi_if_scheduled_mark_missed. x86 CMPXCHG instruction returns
success in ZF flag, so this change saves a compare after cmpxchg
(and related move instruction in front of cmpxchg).

Also, try_cmpxchg implicitly assigns old *ptr value to "old" when cmpxchg
fails, enabling further code simplifications.

Cc: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Link: https://lore.kernel.org/r/20220822143243.2798-1-ubizjak@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

b9030780

Merge tag 'net-6.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 4c612826

Linus Torvalds authored Aug 25, 2022

Pull networking fixes from Jakub Kicinski:
 "Including fixes from ipsec and netfilter (with one broken Fixes tag).

  Current release - new code bugs:

   - dsa: don't dereference NULL extack in dsa_slave_changeupper()

   - dpaa: fix <1G ethernet on LS1046ARDB

   - neigh: don't call kfree_skb() under spin_lock_irqsave()

  Previous releases - regressions:

   - r8152: fix the RX FIFO settings when suspending

   - dsa: microchip: keep compatibility with device tree blobs with no
     phy-mode

   - Revert "net: macsec: update SCI upon MAC address change."

   - Revert "xfrm: update SA curlft.use_time", comply with RFC 2367

  Previous releases - always broken:

   - netfilter: conntrack: work around exceeded TCP receive window

   - ipsec: fix a null pointer dereference of dst->dev on a metadata dst
     in xfrm_lookup_with_ifid

   - moxa: get rid of asymmetry in DMA mapping/unmapping

   - dsa: microchip: make learning configurable and keep it off while
     standalone

   - ice: xsk: prohibit usage of non-balanced queue id

   - rxrpc: fix locking in rxrpc's sendmsg

  Misc:

   - another chunk of sysctl data race silencing"

* tag 'net-6.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (87 commits)
  net: lantiq_xrx200: restore buffer if memory allocation failed
  net: lantiq_xrx200: fix lock under memory pressure
  net: lantiq_xrx200: confirm skb is allocated before using
  net: stmmac: work around sporadic tx issue on link-up
  ionic: VF initial random MAC address if no assigned mac
  ionic: fix up issues with handling EAGAIN on FW cmds
  ionic: clear broken state on generation change
  rxrpc: Fix locking in rxrpc's sendmsg
  net: ethernet: mtk_eth_soc: fix hw hash reporting for MTK_NETSYS_V2
  MAINTAINERS: rectify file entry in BONDING DRIVER
  i40e: Fix incorrect address type for IPv6 flow rules
  ixgbe: stop resetting SYSTIME in ixgbe_ptp_start_cyclecounter
  net: Fix a data-race around sysctl_somaxconn.
  net: Fix a data-race around netdev_unregister_timeout_secs.
  net: Fix a data-race around gro_normal_batch.
  net: Fix data-races around sysctl_devconf_inherit_init_net.
  net: Fix data-races around sysctl_fb_tunnels_only_for_init_net.
  net: Fix a data-race around netdev_budget_usecs.
  net: Fix data-races around sysctl_max_skb_frags.
  net: Fix a data-race around netdev_budget.
  ...

4c612826

Merge branch 'mlxsw-remove-some-unused-code' · 2c58a914

Jakub Kicinski authored Aug 25, 2022

Petr Machata says:

====================
mlxsw: Remove some unused code

This patchset removes code that is not used anymore after the following two
commits removed all users:

- commit b0d80c01 ("mlxsw: Remove Mellanox SwitchX-2 ASIC support")
- commit 9b43fbb8 ("mlxsw: Remove Mellanox SwitchIB ASIC support")
====================

Link: https://lore.kernel.org/r/cover.1661350629.git.petrm@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

2c58a914

mlxsw: Remove unused mlxsw_core_port_type_get() · 12be3edf

Jiri Pirko authored Aug 24, 2022

Function mlxsw_core_port_type_get() is no longer used. So remove it.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

12be3edf

mlxsw: Remove unused port_type_set devlink op · 04a1b674

Jiri Pirko authored Aug 24, 2022

port_type_set devlink op is no longer used by any mlxsw driver,
so remove it.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

04a1b674

mlxsw: Remove unused IB stuff · 3471ac9b

Jiri Pirko authored Aug 24, 2022

There are some IB leftovers that are no longer used in the code.
So remove them.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

3471ac9b

Merge branch 'net-devlink-sync-flash-and-dev-info-commands' · bace3f46

Jakub Kicinski authored Aug 25, 2022

Jiri Pirko says:

====================
net: devlink: sync flash and dev info commands

Purpose of this patchset is to introduce consistency between two devlink
commands:
  devlink dev info
    Shows versions of running default flash target and components.
  devlink dev flash
    Flashes default flash target or component name (if specified
    on cmdline).

Currently it is up to the driver what versions to expose and what flash
update component names to accept. This is inconsistent. Thankfully, only
netdevsim currently using components so it is still time
to sanitize this.

This patchset makes sure, that devlink.c calls into driver for
component flash update only in case the driver exposes the same version
name.

Example:
$ devlink dev info
netdevsim/netdevsim10:
  driver netdevsim
  versions:
      running:
        fw.mgmt 10.20.30
      stored:
        fw.mgmt 10.20.30
$ devlink dev flash netdevsim/netdevsim10 file somefile.bin
[fw.mgmt] Preparing to flash
[fw.mgmt] Flashing 100%
[fw.mgmt] Flash select
[fw.mgmt] Flashing done
$ devlink dev flash netdevsim/netdevsim10 file somefile.bin component fw.mgmt
[fw.mgmt] Preparing to flash
[fw.mgmt] Flashing 100%
[fw.mgmt] Flash select
[fw.mgmt] Flashing done

$ devlink dev flash netdevsim/netdevsim10 file somefile.bin component dummy
Error: selected component is not supported by this device.
====================

Link: https://lore.kernel.org/r/20220824122011.1204330-1-jiri@resnulli.usSigned-off-by: Jakub Kicinski <kuba@kernel.org>

bace3f46

net: devlink: limit flash component name to match version returned by info_get() · f94b6063

Jiri Pirko authored Aug 24, 2022

Limit the acceptance of component name passed to cmd_flash_update() to
match one of the versions returned by info_get(), marked by version type.
This makes things clearer and enforces 1:1 mapping between exposed
version and accepted flash component.

Check VERSION_TYPE_COMPONENT version type during cmd_flash_update()
execution by calling info_get() with different "req" context.
That causes info_get() to lookup the component name instead of
filling-up the netlink message.

Remove "UPDATE_COMPONENT" flag which becomes used.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

f94b6063

netdevsim: add version fw.mgmt info info_get() and mark as a component · 0c198975

Jiri Pirko authored Aug 24, 2022

Fix the only component user which is netdevsim. It uses component named
"fw.mgmt" in selftests. So add this version to info_get() output with
version type component.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

0c198975

net: devlink: extend info_get() version put to indicate a flash component · bb670123

Jiri Pirko authored Aug 24, 2022

Whenever the driver is called by his info_get() op, it may put multiple
version names and values to the netlink message. Extend by additional
helper devlink_info_version_running/stored_put_ext() that allows to
specify a version type that indicates when particular version name
represents a flash component.

This is going to be used in follow-up patch calling info_get() during
flash update command checking if version with this the version type
exists.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bb670123

selftests/net: fix reinitialization of TEST_PROGS in net self tests. · 88e500af

Adel Abouchaev authored Aug 24, 2022

Assinging will drop all previous tests.

Fixes: b690842d ("selftests/net: test l2 tunnel TOS/TTL inheriting")
Signed-off-by: Adel Abouchaev <adel.abushaev@gmail.com>
Reviewed-by: Shuah Khan <skhan@linuxfoundation.org>
Link: https://lore.kernel.org/r/20220824184351.3759862-1-adel.abushaev@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

88e500af

Merge branch 'net-lantiq_xrx200-fix-errors-under-memory-pressure' · d974730c

Jakub Kicinski authored Aug 25, 2022

Aleksander Jan Bajkowski says:

====================
net: lantiq_xrx200: fix errors under memory pressure

This series fixes issues that can occur in the driver under memory pressure.
Situations when the system cannot allocate memory are rare, so the mentioned
bugs have been fixed recently. The patches have been tested on a BT Home
router with the Lantiq xRX200 chipset.

Changelog:
  v3: - removed netdev_err() log from the first patch
  v2:
   - the second patch has been changed, so that under memory pressure situation
     the driver will not receive packets indefinitely regardless of the NAPI budget,
   - the third patch has been added.
====================

Link: https://lore.kernel.org/r/20220824215408.4695-1-olek2@wp.plSigned-off-by: Jakub Kicinski <kuba@kernel.org>

d974730c

net: lantiq_xrx200: restore buffer if memory allocation failed · c9c3b177

Aleksander Jan Bajkowski authored Aug 24, 2022

In a situation where memory allocation fails, an invalid buffer address
is stored. When this descriptor is used again, the system panics in the
build_skb() function when accessing memory.

Fixes: 7ea6cd16 ("lantiq: net: fix duplicated skb in rx descriptor ring")
Signed-off-by: Aleksander Jan Bajkowski <olek2@wp.pl>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

c9c3b177

net: lantiq_xrx200: fix lock under memory pressure · c4b6e934

Aleksander Jan Bajkowski authored Aug 24, 2022

When the xrx200_hw_receive() function returns -ENOMEM, the NAPI poll
function immediately returns an error.
This is incorrect for two reasons:
* the function terminates without enabling interrupts or scheduling NAPI,
* the error code (-ENOMEM) is returned instead of the number of received
packets.

After the first memory allocation failure occurs, packet reception is
locked due to disabled interrupts from DMA..

Fixes: fe1a5642 ("net: lantiq: Add Lantiq / Intel VRX200 Ethernet driver")
Signed-off-by: Aleksander Jan Bajkowski <olek2@wp.pl>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

c4b6e934

net: lantiq_xrx200: confirm skb is allocated before using · c8b04370

Aleksander Jan Bajkowski authored Aug 24, 2022

xrx200_hw_receive() assumes build_skb() always works and goes straight
to skb_reserve(). However, build_skb() can fail under memory pressure.

Add a check in case build_skb() failed to allocate and return NULL.

Fixes: e0155935 ("net: lantiq_xrx200: convert to build_skb")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Aleksander Jan Bajkowski <olek2@wp.pl>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

c8b04370

net: stmmac: work around sporadic tx issue on link-up · a3a57bf0

Heiner Kallweit authored Aug 24, 2022

This is a follow-up to the discussion in [0]. It seems to me that
at least the IP version used on Amlogic SoC's sometimes has a problem
if register MAC_CTRL_REG is written whilst the chip is still processing
a previous write. But that's just a guess.
Adding a delay between two writes to this register helps, but we can
also simply omit the offending second write. This patch uses the second
approach and is based on a suggestion from Qi Duan.
Benefit of this approach is that we can save few register writes, also
on not affected chip versions.

[0] https://www.spinics.net/lists/netdev/msg831526.html

Fixes: bfab27a1 ("stmmac: add the experimental PCI support")
Suggested-by: Qi Duan <qi.duan@amlogic.com>
Suggested-by: Jerome Brunet <jbrunet@baylibre.com>
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://lore.kernel.org/r/e99857ce-bd90-5093-ca8c-8cd480b5a0a2@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

a3a57bf0

Merge branch '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue · ef332fe1

Jakub Kicinski authored Aug 25, 2022

Tony Nguyen says:

====================
Intel Wired LAN Driver Updates 2022-08-24 (ixgbe, i40e)

This series contains updates to ixgbe and i40e drivers.

Jake stops incorrect resetting of SYSTIME registers when starting
cyclecounter for ixgbe.

Sylwester corrects a check on source IP address when validating destination
for i40e.

* '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
  i40e: Fix incorrect address type for IPv6 flow rules
  ixgbe: stop resetting SYSTIME in ixgbe_ptp_start_cyclecounter
====================

Link: https://lore.kernel.org/r/20220824193748.874343-1-anthony.l.nguyen@intel.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

ef332fe1

Merge branch 'ionic-bug-fixes' · 92df825a

Jakub Kicinski authored Aug 25, 2022

Shannon Nelson says:

====================
ionic: bug fixes

These are a couple of maintenance bug fixes for the Pensando ionic
networking driver.

Mohamed takes care of a "plays well with others" issue where the
VF spec is a bit vague on VF mac addresses, but certain customers
have come to expect behavior based on other vendor drivers.

Shannon addresses a couple of corner cases seen in internal
stress testing.
====================

Link: https://lore.kernel.org/r/20220824165051.6185-1-snelson@pensando.ioSigned-off-by: Jakub Kicinski <kuba@kernel.org>

92df825a

ionic: VF initial random MAC address if no assigned mac · 19058be7

R Mohamed Shah authored Aug 24, 2022

Assign a random mac address to the VF interface station
address if it boots with a zero mac address in order to match
similar behavior seen in other VF drivers.  Handle the errors
where the older firmware does not allow the VF to set its own
station address.

Newer firmware will allow the VF to set the station mac address
if it hasn't already been set administratively through the PF.
Setting it will also be allowed if the VF has trust.

Fixes: fbb39807 ("ionic: support sr-iov operations")
Signed-off-by: R Mohamed Shah <mohamed@pensando.io>
Signed-off-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

19058be7

ionic: fix up issues with handling EAGAIN on FW cmds · 0fc4dd45

Shannon Nelson authored Aug 24, 2022

In looping on FW update tests we occasionally see the
FW_ACTIVATE_STATUS command fail while it is in its EAGAIN loop
waiting for the FW activate step to finsh inside the FW.  The
firmware is complaining that the done bit is set when a new
dev_cmd is going to be processed.

Doing a clean on the cmd registers and doorbell before exiting
the wait-for-done and cleaning the done bit before the sleep
prevents this from occurring.

Fixes: fbfb8031 ("ionic: Add hardware init and device commands")
Signed-off-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

0fc4dd45

ionic: clear broken state on generation change · 9cb9dadb

Shannon Nelson authored Aug 24, 2022

There is a case found in heavy testing where a link flap happens just
before a firmware Recovery event and the driver gets stuck in the
BROKEN state.  This comes from the driver getting interrupted by a FW
generation change when coming back up from the link flap, and the call
to ionic_start_queues() in ionic_link_status_check() fails.  This can be
addressed by having the fw_up code clear the BROKEN bit if seen, rather
than waiting for a user to manually force the interface down and then
back up.

Fixes: 9e8eaf84 ("ionic: stop watchdog when in broken state")
Signed-off-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

9cb9dadb

rxrpc: Fix locking in rxrpc's sendmsg · b0f571ec

David Howells authored Aug 24, 2022

Fix three bugs in the rxrpc's sendmsg implementation:

 (1) rxrpc_new_client_call() should release the socket lock when returning
     an error from rxrpc_get_call_slot().

 (2) rxrpc_wait_for_tx_window_intr() will return without the call mutex
     held in the event that we're interrupted by a signal whilst waiting
     for tx space on the socket or relocking the call mutex afterwards.

     Fix this by: (a) moving the unlock/lock of the call mutex up to
     rxrpc_send_data() such that the lock is not held around all of
     rxrpc_wait_for_tx_window*() and (b) indicating to higher callers
     whether we're return with the lock dropped.  Note that this means
     recvmsg() will not block on this call whilst we're waiting.

 (3) After dropping and regaining the call mutex, rxrpc_send_data() needs
     to go and recheck the state of the tx_pending buffer and the
     tx_total_len check in case we raced with another sendmsg() on the same
     call.

Thinking on this some more, it might make sense to have different locks for
sendmsg() and recvmsg().  There's probably no need to make recvmsg() wait
for sendmsg().  It does mean that recvmsg() can return MSG_EOR indicating
that a call is dead before a sendmsg() to that call returns - but that can
currently happen anyway.

Without fix (2), something like the following can be induced:

	WARNING: bad unlock balance detected!
	5.16.0-rc6-syzkaller #0 Not tainted
	-------------------------------------
	syz-executor011/3597 is trying to release lock (&call->user_mutex) at:
	[<ffffffff885163a3>] rxrpc_do_sendmsg+0xc13/0x1350 net/rxrpc/sendmsg.c:748
	but there are no more locks to release!

	other info that might help us debug this:
	no locks held by syz-executor011/3597.
	...
	Call Trace:
	 <TASK>
	 __dump_stack lib/dump_stack.c:88 [inline]
	 dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
	 print_unlock_imbalance_bug include/trace/events/lock.h:58 [inline]
	 __lock_release kernel/locking/lockdep.c:5306 [inline]
	 lock_release.cold+0x49/0x4e kernel/locking/lockdep.c:5657
	 __mutex_unlock_slowpath+0x99/0x5e0 kernel/locking/mutex.c:900
	 rxrpc_do_sendmsg+0xc13/0x1350 net/rxrpc/sendmsg.c:748
	 rxrpc_sendmsg+0x420/0x630 net/rxrpc/af_rxrpc.c:561
	 sock_sendmsg_nosec net/socket.c:704 [inline]
	 sock_sendmsg+0xcf/0x120 net/socket.c:724
	 ____sys_sendmsg+0x6e8/0x810 net/socket.c:2409
	 ___sys_sendmsg+0xf3/0x170 net/socket.c:2463
	 __sys_sendmsg+0xe5/0x1b0 net/socket.c:2492
	 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
	 do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
	 entry_SYSCALL_64_after_hwframe+0x44/0xae

[Thanks to Hawkins Jiawei and Khalid Masum for their attempts to fix this]

Fixes: bc5e3a54 ("rxrpc: Use MSG_WAITALL to tell sendmsg() to temporarily ignore signals")
Reported-by: syzbot+7f0483225d0c94cb3441@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
Tested-by: syzbot+7f0483225d0c94cb3441@syzkaller.appspotmail.com
cc: Hawkins Jiawei <yin31149@gmail.com>
cc: Khalid Masum <khalid.masum.92@gmail.com>
cc: Dan Carpenter <dan.carpenter@oracle.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/166135894583.600315.7170979436768124075.stgit@warthog.procyon.org.ukSigned-off-by: Jakub Kicinski <kuba@kernel.org>

b0f571ec

Merge tag 'cgroup-for-6.0-rc2-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup · 3f5c2005

Linus Torvalds authored Aug 25, 2022

Pull another cgroup fix from Tejun Heo:
 "Commit 4f7e7236 ("cgroup: Fix threadgroup_rwsem <->
  cpus_read_lock() deadlock") required the cgroup
  core to grab cpus_read_lock() before invoking ->attach().

  Unfortunately, it missed adding cpus_read_lock() in
  cgroup_attach_task_all(). Fix it"

* tag 'cgroup-for-6.0-rc2-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: Add missing cpus_read_lock() to cgroup_attach_task_all()

3f5c2005

cgroup: Add missing cpus_read_lock() to cgroup_attach_task_all() · 43626dad

Tetsuo Handa authored Aug 25, 2022

syzbot is hitting percpu_rwsem_assert_held(&cpu_hotplug_lock) warning at
cpuset_attach() [1], for commit 4f7e7236 ("cgroup: Fix
threadgroup_rwsem <-> cpus_read_lock() deadlock") missed that
cpuset_attach() is also called from cgroup_attach_task_all().
Add cpus_read_lock() like what cgroup_procs_write_start() does.

Link: https://syzkaller.appspot.com/bug?extid=29d3a3b4d86c8136ad9e [1]
Reported-by: syzbot <syzbot+29d3a3b4d86c8136ad9e@syzkaller.appspotmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Fixes: 4f7e7236 ("cgroup: Fix threadgroup_rwsem <-> cpus_read_lock() deadlock")
Signed-off-by: Tejun Heo <tj@kernel.org>

43626dad

net: sched: delete duplicate cleanup of backlog and qlen · c19d893f

Zhengchao Shao authored Aug 24, 2022

qdisc_reset() is clearing qdisc->q.qlen and qdisc->qstats.backlog
_after_ calling qdisc->ops->reset. There is no need to clear them
again in the specific reset function.
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Link: https://lore.kernel.org/r/20220824005231.345727-1-shaozhengchao@huawei.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>

c19d893f

net: ethernet: mtk_eth_soc: fix hw hash reporting for MTK_NETSYS_V2 · 0cf731f9

Lorenzo Bianconi authored Aug 23, 2022

Properly report hw rx hash for mt7986 chipset accroding to the new dma
descriptor layout.

Fixes: 197c9e9b ("net: ethernet: mtk_eth_soc: introduce support for mt7986 chipset")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://lore.kernel.org/r/091394ea4e705fbb35f828011d98d0ba33808f69.1661257293.git.lorenzo@kernel.orgSigned-off-by: Paolo Abeni <pabeni@redhat.com>

0cf731f9

nfp: flower: support case of match on ct_state(0/0x3f) · ff763011

Wenjuan Geng authored Aug 23, 2022

is_post_ct_flow() function will process only ct_state ESTABLISHED,
then offload_pre_check() function will check FLOW_DISSECTOR_KEY_CT flag.
When config tc filter match ct_state(0/0x3f), dissector->used_keys
with FLOW_DISSECTOR_KEY_CT bit, function offload_pre_check() will
return false, so not offload. This is a special case that can be handled
safely.

Therefore, modify to let initial packet which won't go through conntrack
can be offloaded, as long as the cared ct fields are all zero.
Signed-off-by: Wenjuan Geng <wenjuan.geng@corigine.com>
Reviewed-by: Louis Peens <louis.peens@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/20220823090122.403631-1-simon.horman@corigine.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>

ff763011

net: gro: skb_gro_header helper function · 35ffb665

Richard Gobert authored Aug 23, 2022

Introduce a simple helper function to replace a common pattern.
When accessing the GRO header, we fetch the pointer from frag0,
then test its validity and fetch it from the skb when necessary.

This leads to the pattern
skb_gro_header_fast -> skb_gro_header_hard -> skb_gro_header_slow
recurring many times throughout GRO code.

This patch replaces these patterns with a single inlined function
call, improving code readability.
Signed-off-by: Richard Gobert <richardbgobert@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20220823071034.GA56142@debianSigned-off-by: Paolo Abeni <pabeni@redhat.com>

35ffb665

Documentation: devlink: fix the locking section · 77a70f9c

Jiri Pirko authored Aug 23, 2022

As all callbacks are converted now, fix the text reflecting that change.
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20220823070213.1008956-1-jiri@resnulli.usSigned-off-by: Jakub Kicinski <kuba@kernel.org>

77a70f9c

Merge branch 'add-a-second-bind-table-hashed-by-port-and-address' · c21e1bf4

Jakub Kicinski authored Aug 24, 2022

Joanne Koong says:

====================
Add a second bind table hashed by port and address

Currently, there is one bind hashtable (bhash) that hashes by port only.
This patchset adds a second bind table (bhash2) that hashes by port and
address.

The motivation for adding bhash2 is to expedite bind requests in situations
where the port has many sockets in its bhash table entry (eg a large number
of sockets bound to different addresses on the same port), which makes checking
bind conflicts costly especially given that we acquire the table entry spinlock
while doing so, which can cause softirq cpu lockups and can prevent new tcp
connections.

We ran into this problem at Meta where the traffic team binds a large number
of IPs to port 443 and the bind() call took a significant amount of time
which led to cpu softirq lockups, which caused packet drops and other failures
on the machine.

When experimentally testing this on a local server for ~24k sockets bound to
the port, the results seen were:

ipv4:
before - 0.002317 seconds
with bhash2 - 0.000020 seconds

ipv6:
before - 0.002431 seconds
with bhash2 - 0.000021 seconds

The additions to the initial bhash2 submission [0] are:
* Updating bhash2 in the cases where a socket's rcv saddr changes after it has
* been bound
* Adding locks for bhash2 hashbuckets

[0] https://lore.kernel.org/netdev/20220520001834.2247810-1-kuba@kernel.org/

v3: https://lore.kernel.org/netdev/20220722195406.1304948-2-joannelkoong@gmail.com/
v2: https://lore.kernel.org/netdev/20220712235310.1935121-1-joannelkoong@gmail.com/
v1: https://lore.kernel.org/netdev/20220623234242.2083895-2-joannelkoong@gmail.com/
====================

Link: https://lore.kernel.org/r/20220822181023.3979645-1-joannelkoong@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

c21e1bf4

selftests/net: Add sk_bind_sendto_listen and sk_connect_zero_addr · 1be9ac87

Joanne Koong authored Aug 22, 2022

This patch adds 2 new tests: sk_bind_sendto_listen and
sk_connect_zero_addr.

The sk_bind_sendto_listen test exercises the path where a socket's
rcv saddr changes after it has been added to the binding tables,
and then a listen() on the socket is invoked. The listen() should
succeed.

The sk_bind_sendto_listen test is copied over from one of syzbot's
tests: https://syzkaller.appspot.com/x/repro.c?x=1673a38df00000

The sk_connect_zero_addr test exercises the path where the socket was
never previously added to the binding tables and it gets assigned a
saddr upon a connect() to address 0.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

1be9ac87

selftests/net: Add test for timing a bind request to a port with a populated bhash entry · c35ecb95

Joanne Koong authored Aug 22, 2022

This test populates the bhash table for a given port with
MAX_THREADS * MAX_CONNECTIONS sockets, and then times how long
a bind request on the port takes.

When populating the bhash table, we create the sockets and then bind
the sockets to the same address and port (SO_REUSEADDR and SO_REUSEPORT
are set). When timing how long a bind on the port takes, we bind on a
different address without SO_REUSEPORT set. We do not set SO_REUSEPORT
because we are interested in the case where the bind request does not
go through the tb->fastreuseport path, which is fragile (eg
tb->fastreuseport path does not work if binding with a different uid).

To run the script:
    Usage: ./bind_bhash.sh [-6 | -4] [-p port] [-a address]
	    6: use ipv6
	    4: use ipv4
	    port: Port number
	    address: ip address

Without any arguments, ./bind_bhash.sh defaults to ipv6 using ip address
"2001:0db8:0:f101::1" on port 443.

On my local machine, I see:
ipv4:
before - 0.002317 seconds
with bhash2 - 0.000020 seconds

ipv6:
before - 0.002431 seconds
with bhash2 - 0.000021 seconds
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

c35ecb95