- 18 Oct, 2023 28 commits
-
-
Florian Westphal authored
Same as previous change: we need to mask out the non-verdict bits, as upcoming patches may embed an errno value in NF_STOLEN verdicts too. NF_DROP could already do this, but not all called functions do this. Checks that only test ret vs NF_ACCEPT are fine, the 'errno parts' are always 0 for those. Signed-off-by: Florian Westphal <fw@strlen.de>
-
Florian Westphal authored
This function calls helpers that can return nf-verdicts, but then those get converted to -1/0 as thats what the caller expects. Theoretically NF_DROP could have an errno number set in the upper 24 bits of the return value. Or any of those helpers could return NF_STOLEN, which would result in use-after-free. This is fine as-is, the called functions don't do this yet. But its better to avoid possible future problems if the upcoming patchset to add NF_DROP_REASON() support gains further users, so remove the 0/-1 translation from the picture and pass the verdicts down to the caller. Signed-off-by: Florian Westphal <fw@strlen.de>
-
Florian Westphal authored
nftables trace infra must mask out the non-verdict bit parts of the return value, else followup changes that 'return errno << 8 | NF_STOLEN' will cause breakage. Signed-off-by: Florian Westphal <fw@strlen.de>
-
Florian Westphal authored
These checks assume that the caller only returns NF_DROP without any errno embedded in the upper bits. This is fine right now, but followup patches will start to propagate such errors to allow kfree_skb_drop_reason() in the called functions, those would then indicate 'errno << 8 | NF_STOLEN'. To not break things we have to mask those parts out. Signed-off-by: Florian Westphal <fw@strlen.de>
-
David S. Miller authored
Jiri Pirko says: ==================== devlink: fix a deadlock when taking devlink instance lock while holding RTNL lock devlink_port_fill() may be called sometimes with RTNL lock held. When putting the nested port function devlink instance attrs, current code takes nested devlink instance lock. In that case lock ordering is wrong. Patch #1 is a dependency of patch #2. Patch #2 converts the peernet2id_alloc() call to rely in RCU so it could called without devlink instance lock. Patch #3 takes device reference for devlink instance making sure that device does not disappear before devlink_release() is called. Patch #4 benefits from the preparations done in patches #2 and #3 and removes the problematic nested devlink lock aquisition. Patched #5-#7 improve documentation to reflect this issue so it is avoided in the future. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
Add a documentation for devlink_rel_nested_in_notify() describing the devlink instance locking consequences. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
Add a note describing the locking order of taking RTNL lock with devlink instance lock. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
Add a part talking about nested devlink instances describing the helpers and locking ordering. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
Lockdep reports following issue: WARNING: possible circular locking dependency detected ------------------------------------------------------ devlink/8191 is trying to acquire lock: ffff88813f32c250 (&devlink->lock_key#14){+.+.}-{3:3}, at: devlink_rel_devlink_handle_put+0x11e/0x2d0 but task is already holding lock: ffffffff8511eca8 (rtnl_mutex){+.+.}-{3:3}, at: unregister_netdev+0xe/0x20 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (rtnl_mutex){+.+.}-{3:3}: lock_acquire+0x1c3/0x500 __mutex_lock+0x14c/0x1b20 register_netdevice_notifier_net+0x13/0x30 mlx5_lag_add_mdev+0x51c/0xa00 [mlx5_core] mlx5_load+0x222/0xc70 [mlx5_core] mlx5_init_one_devl_locked+0x4a0/0x1310 [mlx5_core] mlx5_init_one+0x3b/0x60 [mlx5_core] probe_one+0x786/0xd00 [mlx5_core] local_pci_probe+0xd7/0x180 pci_device_probe+0x231/0x720 really_probe+0x1e4/0xb60 __driver_probe_device+0x261/0x470 driver_probe_device+0x49/0x130 __driver_attach+0x215/0x4c0 bus_for_each_dev+0xf0/0x170 bus_add_driver+0x21d/0x590 driver_register+0x133/0x460 vdpa_match_remove+0x89/0xc0 [vdpa] do_one_initcall+0xc4/0x360 do_init_module+0x22d/0x760 load_module+0x51d7/0x6750 init_module_from_file+0xd2/0x130 idempotent_init_module+0x326/0x5a0 __x64_sys_finit_module+0xc1/0x130 do_syscall_64+0x3d/0x90 entry_SYSCALL_64_after_hwframe+0x46/0xb0 -> #2 (mlx5_intf_mutex){+.+.}-{3:3}: lock_acquire+0x1c3/0x500 __mutex_lock+0x14c/0x1b20 mlx5_register_device+0x3e/0xd0 [mlx5_core] mlx5_init_one_devl_locked+0x8fa/0x1310 [mlx5_core] mlx5_devlink_reload_up+0x147/0x170 [mlx5_core] devlink_reload+0x203/0x380 devlink_nl_cmd_reload+0xb84/0x10e0 genl_family_rcv_msg_doit+0x1cc/0x2a0 genl_rcv_msg+0x3c9/0x670 netlink_rcv_skb+0x12c/0x360 genl_rcv+0x24/0x40 netlink_unicast+0x435/0x6f0 netlink_sendmsg+0x7a0/0xc70 sock_sendmsg+0xc5/0x190 __sys_sendto+0x1c8/0x290 __x64_sys_sendto+0xdc/0x1b0 do_syscall_64+0x3d/0x90 entry_SYSCALL_64_after_hwframe+0x46/0xb0 -> #1 (&dev->lock_key#8){+.+.}-{3:3}: lock_acquire+0x1c3/0x500 __mutex_lock+0x14c/0x1b20 mlx5_init_one_devl_locked+0x45/0x1310 [mlx5_core] mlx5_devlink_reload_up+0x147/0x170 [mlx5_core] devlink_reload+0x203/0x380 devlink_nl_cmd_reload+0xb84/0x10e0 genl_family_rcv_msg_doit+0x1cc/0x2a0 genl_rcv_msg+0x3c9/0x670 netlink_rcv_skb+0x12c/0x360 genl_rcv+0x24/0x40 netlink_unicast+0x435/0x6f0 netlink_sendmsg+0x7a0/0xc70 sock_sendmsg+0xc5/0x190 __sys_sendto+0x1c8/0x290 __x64_sys_sendto+0xdc/0x1b0 do_syscall_64+0x3d/0x90 entry_SYSCALL_64_after_hwframe+0x46/0xb0 -> #0 (&devlink->lock_key#14){+.+.}-{3:3}: check_prev_add+0x1af/0x2300 __lock_acquire+0x31d7/0x4eb0 lock_acquire+0x1c3/0x500 __mutex_lock+0x14c/0x1b20 devlink_rel_devlink_handle_put+0x11e/0x2d0 devlink_nl_port_fill+0xddf/0x1b00 devlink_port_notify+0xb5/0x220 __devlink_port_type_set+0x151/0x510 devlink_port_netdevice_event+0x17c/0x220 notifier_call_chain+0x97/0x240 unregister_netdevice_many_notify+0x876/0x1790 unregister_netdevice_queue+0x274/0x350 unregister_netdev+0x18/0x20 mlx5e_vport_rep_unload+0xc5/0x1c0 [mlx5_core] __esw_offloads_unload_rep+0xd8/0x130 [mlx5_core] mlx5_esw_offloads_rep_unload+0x52/0x70 [mlx5_core] mlx5_esw_offloads_unload_rep+0x85/0xc0 [mlx5_core] mlx5_eswitch_unload_sf_vport+0x41/0x90 [mlx5_core] mlx5_devlink_sf_port_del+0x120/0x280 [mlx5_core] genl_family_rcv_msg_doit+0x1cc/0x2a0 genl_rcv_msg+0x3c9/0x670 netlink_rcv_skb+0x12c/0x360 genl_rcv+0x24/0x40 netlink_unicast+0x435/0x6f0 netlink_sendmsg+0x7a0/0xc70 sock_sendmsg+0xc5/0x190 __sys_sendto+0x1c8/0x290 __x64_sys_sendto+0xdc/0x1b0 do_syscall_64+0x3d/0x90 entry_SYSCALL_64_after_hwframe+0x46/0xb0 other info that might help us debug this: Chain exists of: &devlink->lock_key#14 --> mlx5_intf_mutex --> rtnl_mutex Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(rtnl_mutex); lock(mlx5_intf_mutex); lock(rtnl_mutex); lock(&devlink->lock_key#14); Problem is taking the devlink instance lock of nested instance when RTNL is already held. To fix this, don't take the devlink instance lock when putting nested handle. Instead, rely on the preparations done by previous two patches to be able to access device pointer and obtain netns id without devlink instance lock held. Fixes: c137743b ("devlink: introduce object and nested devlink relationship infra") Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
In preparation to allow to access device pointer without devlink instance lock held, make sure the device pointer is usable until devlink_release() is called. Fixes: c137743b ("devlink: introduce object and nested devlink relationship infra") Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
peernet2id_alloc() allows to be called lockless with peer net pointer obtained in RCU critical section and makes sure to return ns ID if net namespaces is not being removed concurrently. Benefit from read_pnet_rcu() helper addition, use it to obtain net pointer under RCU read lock and pass it to peernet2id_alloc() to get ns ID. Fixes: c137743b ("devlink: introduce object and nested devlink relationship infra") Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
Make the net pointer stored in possible_net_t structure annotated as an RCU pointer. Change the access helpers to treat it as such. Introduce read_pnet_rcu() helper to allow caller to dereference the net pointer under RCU read lock. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linuxJakub Kicinski authored
Saeed Mahameed says: ==================== mlx5-updates-2023-10-10 1) Adham Faris, Increase max supported channels number to 256 2) Leon Romanovsky, Allow IPsec soft/hard limits in bytes 3) Shay Drory, Replace global mlx5_intf_lock with HCA devcom component lock 4) Wei Zhang, Optimize SF creation flow During SF creation, HCA state gets changed from INVALID to IN_USE step by step. Accordingly, FW sends vhca event to driver to inform about this state change asynchronously. Each vhca event is critical because all related SW/FW operations are triggered by it. Currently there is only a single mlx5 general event handler which not only handles vhca event but many other events. This incurs huge bottleneck because all events are forced to be handled in serial manner. Moreover, all SFs share same table_lock which inevitably impacts each other when they are created in parallel. This series will solve this issue by: 1. A dedicated vhca event handler is introduced to eliminate the mutual impact with other mlx5 events. 2. Max FW threads work queues are employed in the vhca event handler to fully utilize FW capability. 3. Redesign SF active work logic to completely remove table_lock. With above optimization, SF creation time is reduced by 25%, i.e. from 80s to 60s when creating 100 SFs. Patches summary: Patch 1 - implement dedicated vhca event handler with max FW cmd threads of work queues. Patch 2 - remove table_lock by redesigning SF active work logic. * tag 'mlx5-updates-2023-10-10' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux: net/mlx5e: Allow IPsec soft/hard limits in bytes net/mlx5e: Increase max supported channels number to 256 net/mlx5e: Preparations for supporting larger number of channels net/mlx5e: Refactor mlx5e_rss_init() and mlx5e_rss_free() API's net/mlx5e: Refactor mlx5e_rss_set_rxfh() and mlx5e_rss_get_rxfh() net/mlx5e: Refactor rx_res_init() and rx_res_free() APIs net/mlx5e: Use PTR_ERR_OR_ZERO() to simplify code net/mlx5: Use PTR_ERR_OR_ZERO() to simplify code net/mlx5: fix config name in Kconfig parameter documentation net/mlx5: Remove unused declaration net/mlx5: Replace global mlx5_intf_lock with HCA devcom component lock net/mlx5: Refactor LAG peer device lookout bus logic to mlx5 devcom net/mlx5: Avoid false positive lockdep warning by adding lock_class_key net/mlx5: Redesign SF active work to remove table_lock net/mlx5: Parallelize vhca event handling ==================== Link: https://lore.kernel.org/r/20231014171908.290428-1-saeed@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Justin Stitt authored
strncpy() is deprecated for use on NUL-terminated destination strings [1] and as such we should prefer more robust and less ambiguous string interfaces. We expect both hi.data.modename and hi.data.drivername to be NUL-terminated based on its usage with sprintf: | sprintf(hi.data.modename, "%sclk,%smodem,fclk=%d,bps=%d%s", | bc->cfg.intclk ? "int" : "ext", | bc->cfg.extmodem ? "ext" : "int", bc->cfg.fclk, bc->cfg.bps, | bc->cfg.loopback ? ",loopback" : ""); Note that this data is copied out to userspace with: | if (copy_to_user(data, &hi, sizeof(hi))) ... however, the data was also copied FROM the user here: | if (copy_from_user(&hi, data, sizeof(hi))) Considering the above, a suitable replacement is strscpy_pad() as it guarantees NUL-termination on the destination buffer while also NUL-padding (which is good+wanted behavior when copying data to userspace). Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1] Link: https://github.com/KSPP/linux/issues/90Signed-off-by: Justin Stitt <justinstitt@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20231016-strncpy-drivers-net-hamradio-baycom_epp-c-v2-1-39f72a72de30@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jakub Kicinski authored
Jiri moved version to legacy specs in commit 0f07415e ("netlink: specs: don't allow version to be specified for genetlink"). Update the documentation. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20231016214540.1822392-1-kuba@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jakub Kicinski authored
I recently cleaned up specs to not specify enum-as-flags when target enum is already defined as flags. YNL Python library did not convert flags, unfortunately, so this caused breakage for Stan and Willem. Note that the nlspec.py abstraction already hides the differences between flags and enums (value vs user_value), so the changes are pretty trivial. Fixes: 0629f22e ("ynl: netdev: drop unnecessary enum-as-flags") Reported-and-tested-by: Willem de Bruijn <willemb@google.com> Reported-and-tested-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/all/ZS10NtQgd_BJZ3RU@google.com/ Link: https://lore.kernel.org/r/20231016213937.1820386-1-kuba@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jakub Kicinski authored
Russell King says: ==================== net: remove last of the phylink validate methods and clean up This four patch series removes the last of the phylink MAC .validate methods which can be found in the Freescale fman driver. fman has a requirement that half duplex may not be supported in RGMII mode, which is currently handled in its .validate method. In order to keep this functionality when removing the .validate method, we need to replace that with equivalent functionality, for which I propose the optional .mac_get_caps method in the first patch. The advantage of this approach over the .validate callback is that MAC drivers only have to deal with the MAC_* capabilities, and don't need to call back into phylink functions to do the masking of the ethtool linkmodes etc - which then becomes internal to phylink. This can be seen in the fourth patch where we make a load of these methods static. ==================== Link: https://lore.kernel.org/r/ZS1Z5DDfHyjMryYu@shell.armlinux.org.ukSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Russell King (Oracle) authored
Remove exports for phylink_caps_to_linkmodes(), phylink_get_capabilities(), phylink_validate_mask_caps() and phylink_generic_validate(). Also, as phylink_generic_validate() is no longer called, we can remove its implementation as well. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://lore.kernel.org/r/E1qsPkK-009wip-W9@rmk-PC.armlinux.org.ukSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Russell King (Oracle) authored
The MAC .validate() method is no longer used, so remove it from the phylink_mac_ops structure, and remove the callsite in phylink_validate_mac_and_pcs(). Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://lore.kernel.org/r/E1qsPkF-009wij-QM@rmk-PC.armlinux.org.ukSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Russell King (Oracle) authored
Convert fman to use the .mac_get_caps() method rather than the .validate() method. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Sean Anderson <sean.anderson@seco.com> Link: https://lore.kernel.org/r/E1qsPkA-009wid-Kv@rmk-PC.armlinux.org.ukSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Russell King (Oracle) authored
Provide a new method, mac_get_caps() to get the MAC capabilities for the specified interface mode. This is for MACs which have special requirements, such as not supporting half-duplex in certain interface modes, and will replace the validate() method. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://lore.kernel.org/r/E1qsPk5-009wiX-G5@rmk-PC.armlinux.org.ukSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jakub Kicinski authored
Recent FW interface update bumped the size of struct hwrm_func_cfg_input above 128B which is the max some devices support. Probe on Stratus (BCM957452) with FW 20.8.3.11 fails with: bnxt_en ...: Unable to reserve tx rings bnxt_en ...: 2nd rings reservation failed. bnxt_en ...: Not enough rings available. Once probe is fixed other errors pop up: bnxt_en ...: Failed to set async event completion ring. This is because __hwrm_send() rejects requests larger than bp->hwrm_max_ext_req_len with -E2BIG. Since the driver doesn't actually access any of the new fields, yet, trim the length. It should be safe. Similar workaround exists for backing_store_cfg_input. Although that one mins() to a constant of 256, not 128 we'll effectively use here. Michael explains: "the backing store cfg command is supported by relatively newer firmware that will accept 256 bytes at least." To make debugging easier in the future add a warning for oversized requests. Fixes: 754fbf60 ("bnxt_en: Update firmware interface to 1.10.2.171") Reviewed-by: Michael Chan <michael.chan@broadcom.com> Link: https://lore.kernel.org/r/20231016171640.1481493-1-kuba@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jakub Kicinski authored
Johannes Nixdorf says: ==================== bridge: Add a limit on learned FDB entries Introduce a limit on the amount of learned FDB entries on a bridge, configured by netlink with a build time default on bridge creation in the kernel config. For backwards compatibility the kernel config default is disabling the limit (0). Without any limit a malicious actor may OOM a kernel by spamming packets with changing MAC addresses on their bridge port, so allow the bridge creator to limit the number of entries. Currently the manual entries are identified by the bridge flags BR_FDB_LOCAL or BR_FDB_ADDED_BY_USER, atomically bundled under the new flag BR_FDB_DYNAMIC_LEARNED. This means the limit also applies to entries created with BR_FDB_ADDED_BY_EXT_LEARN but none of BR_FDB_LOCAL or BR_FDB_ADDED_BY_USER, e.g. ones added by SWITCHDEV_FDB_ADD_TO_BRIDGE. Link to the corresponding iproute2 changes: https://lore.kernel.org/r/20230919-fdb_limit-v4-1-b4d2dc4df30f@avm.de v4: https://lore.kernel.org/r/20230919-fdb_limit-v4-0-39f0293807b8@avm.de/ v3: https://lore.kernel.org/r/20230905-fdb_limit-v3-0-7597cd500a82@avm.de/ v2: https://lore.kernel.org/netdev/20230619071444.14625-1-jnixdorf-oss@avm.de/ v1: https://lore.kernel.org/netdev/20230515085046.4457-1-jnixdorf-oss@avm.de/ ==================== Link: https://lore.kernel.org/r/20231016-fdb_limit-v5-0-32cddff87758@avm.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Johannes Nixdorf authored
Add a suite covering the fdb_n_learned and fdb_max_learned bridge features, touching all special cases in accounting at least once. Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Johannes Nixdorf <jnixdorf-oss@avm.de> Link: https://lore.kernel.org/r/20231016-fdb_limit-v5-5-32cddff87758@avm.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Johannes Nixdorf authored
Set any new attributes added to br_policy to be parsed strictly, to prevent userspace from passing garbage. Signed-off-by: Johannes Nixdorf <jnixdorf-oss@avm.de> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://lore.kernel.org/r/20231016-fdb_limit-v5-4-32cddff87758@avm.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Johannes Nixdorf authored
The previous patch added accounting and a limit for the number of dynamically learned FDB entries per bridge. However it did not provide means to actually configure those bounds or read back the count. This patch does that. Two new netlink attributes are added for the accounting and limit of dynamically learned FDB entries: - IFLA_BR_FDB_N_LEARNED (RO) for the number of entries accounted for a single bridge. - IFLA_BR_FDB_MAX_LEARNED (RW) for the configured limit of entries for the bridge. The new attributes are used like this: # ip link add name br up type bridge fdb_max_learned 256 # ip link add name v1 up master br type veth peer v2 # ip link set up dev v2 # mausezahn -a rand -c 1024 v2 0.01 seconds (90877 packets per second # bridge fdb | grep -v permanent | wc -l 256 # ip -d link show dev br 13: br: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 [...] [...] fdb_n_learned 256 fdb_max_learned 256 Signed-off-by: Johannes Nixdorf <jnixdorf-oss@avm.de> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://lore.kernel.org/r/20231016-fdb_limit-v5-3-32cddff87758@avm.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Johannes Nixdorf authored
A malicious actor behind one bridge port may spam the kernel with packets with a random source MAC address, each of which will create an FDB entry, each of which is a dynamic allocation in the kernel. There are roughly 2^48 different MAC addresses, further limited by the rhashtable they are stored in to 2^31. Each entry is of the type struct net_bridge_fdb_entry, which is currently 128 bytes big. This means the maximum amount of memory allocated for FDB entries is 2^31 * 128B = 256GiB, which is too much for most computers. Mitigate this by maintaining a per bridge count of those automatically generated entries in fdb_n_learned, and a limit in fdb_max_learned. If the limit is hit new entries are not learned anymore. For backwards compatibility the default setting of 0 disables the limit. User-added entries by netlink or from bridge or bridge port addresses are never blocked and do not count towards that limit. Introduce a new fdb entry flag BR_FDB_DYNAMIC_LEARNED to keep track of whether an FDB entry is included in the count. The flag is enabled for dynamically learned entries, and disabled for all other entries. This should be equivalent to BR_FDB_ADDED_BY_USER and BR_FDB_LOCAL being unset, but contrary to the two flags it can be toggled atomically. Atomicity is required here, as there are multiple callers that modify the flags, but are not under a common lock (br_fdb_update is the exception for br->hash_lock, br_fdb_external_learn_add for RTNL). Reviewed-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Johannes Nixdorf <jnixdorf-oss@avm.de> Link: https://lore.kernel.org/r/20231016-fdb_limit-v5-2-32cddff87758@avm.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Johannes Nixdorf authored
In preparation of the following fdb limit for dynamically learned entries, allow fdb_create to detect that the entry was added by the user. This way it can skip applying the limit in this case. Reviewed-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Johannes Nixdorf <jnixdorf-oss@avm.de> Link: https://lore.kernel.org/r/20231016-fdb_limit-v5-1-32cddff87758@avm.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
- 17 Oct, 2023 11 commits
-
-
Jakub Kicinski authored
Merge tag 'wireless-next-2023-10-16' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next Kalle Valo says: ==================== wireless-next patches for v6.7 The second pull request for v6.7, with only driver changes this time. We have now support for mt7925 PCIe and USB variants, few new features and of course some fixes. Major changes: mt76 - mt7925 support ath12k - read board data variant name from SMBIOS wfx - Remain-On-Channel (ROC) support * tag 'wireless-next-2023-10-16' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (109 commits) wifi: rtw89: mac: do bf_monitor only if WiFi 6 chips wifi: rtw89: mac: set bf_assoc capabilities according to chip gen wifi: rtw89: mac: set bfee_ctrl() according to chip gen wifi: rtw89: mac: add registers of MU-EDCA parameters for WiFi 7 chips wifi: rtw89: mac: generalize register of MU-EDCA switch according to chip gen wifi: rtw89: mac: update RTS threshold according to chip gen wifi: rtlwifi: simplify TX command fill callbacks wifi: hostap: remove unused ioctl function wifi: atmel: remove unused ioctl function wifi: rtw89: coex: add annotation __counted_by() to struct rtw89_btc_btf_set_mon_reg wifi: rtw89: coex: add annotation __counted_by() for struct rtw89_btc_btf_set_slot_table wifi: rtw89: add EHT radiotap in monitor mode wifi: rtw89: show EHT rate in debugfs wifi: rtw89: parse TX EHT rate selected by firmware from RA C2H report wifi: rtw89: Add EHT rate mask as parameters of RA H2C command wifi: rtw89: parse EHT information from RX descriptor and PPDU status packet wifi: radiotap: add bandwidth definition of EHT U-SIG wifi: rtlwifi: use convenient list_count_nodes() wifi: p54: Annotate struct p54_cal_database with __counted_by wifi: brcmfmac: fweh: Add __counted_by for struct brcmf_fweh_queue_item and use struct_size() ... ==================== Link: https://lore.kernel.org/r/20231016143822.880D8C433C8@smtp.kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Christophe JAILLET authored
Prepare for the coming implementation by GCC and Clang of the __counted_by attribute. Flexible array members annotated with __counted_by can have their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS (for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family functions). Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/ca5c8049f58bb933f231afd0816e30a5aaa0eddd.1697264974.git.christophe.jaillet@wanadoo.frSigned-off-by: Paolo Abeni <pabeni@redhat.com>
-
Christophe JAILLET authored
Use struct_size() instead of hand writing it. This is less verbose and more robust. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/e5122b4ff878cbf3ed72653a395ad5c4da04dc1e.1697264974.git.christophe.jaillet@wanadoo.frSigned-off-by: Paolo Abeni <pabeni@redhat.com>
-
Paolo Abeni authored
Matt Johnston says: ==================== I3C MCTP net driver This series adds an I3C transport for the kernel's MCTP network protocol. MCTP is a communication protocol between system components (BMCs, drives, NICs etc), with higher level protocols such as NVMe-MI or PLDM built on top of it (in userspace). It runs over various transports such as I2C, PCIe, or I3C. The mctp-i3c driver follows a similar approach to the kernel's existing mctp-i2c driver, creating a "mctpi3cX" network interface for each numbered I3C bus. Busses opt in to support by adding a "mctp-controller" property to the devicetree: &i3c0 { mctp-controller; } The driver will bind to MCTP class devices (DCR 0xCC) that are on a supported I3C bus. Each bus is represented by a `struct mctp_i3c_bus` that keeps state for the network device. An individual I3C device (struct mctp_i3c_device) performs operations using the "parent" mctp_i3c_bus object. The I3C notify/enumeration patch is needed so that the mctp-i3c driver can handle creating/removing mctp_i3c_bus objects as required. The mctp-i3c driver is using the Provisioned ID as an identifier for target I3C devices (the neighbour address), as that will be more stable than the I3C dynamic address. The driver internally translates that to a dynamic address for bus operations. The driver has been tested using an AST2600 platform. A remote endpoint has been tested against QEMU, as well as using the target mode support in Aspeed's vendor tree. I3C maintainers have acked merging this through net-next tree. ==================== Link: https://lore.kernel.org/r/20231013040628.354323-1-matt@codeconstruct.com.auSigned-off-by: Paolo Abeni <pabeni@redhat.com>
-
Matt Johnston authored
Provides MCTP network transport over an I3C bus, as specified in DMTF DSP0233. Each I3C bus (with "mctp-controller" devicetree property) gets an "mctpi3cX" net device created. I3C devices are reachable as remote endpoints through that net device. Link layer addressing uses the I3C PID as a fixed hardware address for neighbour table entries. The driver matches I3C devices that have the MIPI assigned DCR 0xCC for MCTP. Signed-off-by: Matt Johnston <matt@codeconstruct.com.au> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-
Jeremy Kerr authored
This allows other drivers to be notified when new i3c busses are attached, referring to a whole i3c bus as opposed to individual devices. Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au> Signed-off-by: Matt Johnston <matt@codeconstruct.com.au> Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-
Matt Johnston authored
This property is used to describe a I3C bus with attached MCTP I3C target devices. Signed-off-by: Matt Johnston <matt@codeconstruct.com.au> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-
Florian Westphal authored
consume_skb() doesn't walk the segment list, so segments other than the first are leaked. Move this skb_consume call into the loop. Cc: Willem de Bruijn <willemb@google.com> Fixes: b3098d32 ("net: add skb_segment kunit test") Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski authored
Daniel Borkmann says: ==================== pull-request: bpf-next 2023-10-16 We've added 90 non-merge commits during the last 25 day(s) which contain a total of 120 files changed, 3519 insertions(+), 895 deletions(-). The main changes are: 1) Add missed stats for kprobes to retrieve the number of missed kprobe executions and subsequent executions of BPF programs, from Jiri Olsa. 2) Add cgroup BPF sockaddr hooks for unix sockets. The use case is for systemd to reimplement the LogNamespace feature which allows running multiple instances of systemd-journald to process the logs of different services, from Daan De Meyer. 3) Implement BPF CPUv4 support for s390x BPF JIT, from Ilya Leoshkevich. 4) Improve BPF verifier log output for scalar registers to better disambiguate their internal state wrt defaults vs min/max values matching, from Andrii Nakryiko. 5) Extend the BPF fib lookup helpers for IPv4/IPv6 to support retrieving the source IP address with a new BPF_FIB_LOOKUP_SRC flag, from Martynas Pumputis. 6) Add support for open-coded task_vma iterator to help with symbolization for BPF-collected user stacks, from Dave Marchevsky. 7) Add libbpf getters for accessing individual BPF ring buffers which is useful for polling them individually, for example, from Martin Kelly. 8) Extend AF_XDP selftests to validate the SHARED_UMEM feature, from Tushar Vyavahare. 9) Improve BPF selftests cross-building support for riscv arch, from Björn Töpel. 10) Add the ability to pin a BPF timer to the same calling CPU, from David Vernet. 11) Fix libbpf's bpf_tracing.h macros for riscv to use the generic implementation of PT_REGS_SYSCALL_REGS() to access syscall arguments, from Alexandre Ghiti. 12) Extend libbpf to support symbol versioning for uprobes, from Hengqi Chen. 13) Fix bpftool's skeleton code generation to guarantee that ELF data is 8 byte aligned, from Ian Rogers. 14) Inherit system-wide cpu_mitigations_off() setting for Spectre v1/v4 security mitigations in BPF verifier, from Yafang Shao. 15) Annotate struct bpf_stack_map with __counted_by attribute to prepare BPF side for upcoming __counted_by compiler support, from Kees Cook. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (90 commits) bpf: Ensure proper register state printing for cond jumps bpf: Disambiguate SCALAR register state output in verifier logs selftests/bpf: Make align selftests more robust selftests/bpf: Improve missed_kprobe_recursion test robustness selftests/bpf: Improve percpu_alloc test robustness selftests/bpf: Add tests for open-coded task_vma iter bpf: Introduce task_vma open-coded iterator kfuncs selftests/bpf: Rename bpf_iter_task_vma.c to bpf_iter_task_vmas.c bpf: Don't explicitly emit BTF for struct btf_iter_num bpf: Change syscall_nr type to int in struct syscall_tp_t net/bpf: Avoid unused "sin_addr_len" warning when CONFIG_CGROUP_BPF is not set bpf: Avoid unnecessary audit log for CPU security mitigations selftests/bpf: Add tests for cgroup unix socket address hooks selftests/bpf: Make sure mount directory exists documentation/bpf: Document cgroup unix socket address hooks bpftool: Add support for cgroup unix socket address hooks libbpf: Add support for cgroup unix socket address hooks bpf: Implement cgroup sockaddr hooks for unix sockets bpf: Add bpf_sock_addr_set_sun_path() to allow writing unix sockaddr from bpf bpf: Propagate modified uaddrlen from cgroup sockaddr programs ... ==================== Link: https://lore.kernel.org/r/20231016204803.30153-1-daniel@iogearbox.netSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Yunsheng Lin authored
Currently page_pool_alloc_frag() is not supported in 32-bit arch with 64-bit DMA because of the overlap issue between pp_frag_count and dma_addr_upper in 'struct page' for those arches, which seems to be quite common, see [1], which means driver may need to handle it when using fragment API. It is assumed that the combination of the above arch with an address space >16TB does not exist, as all those arches have 64b equivalent, it seems logical to use the 64b version for a system with a large address space. It is also assumed that dma address is page aligned when we are dma mapping a page aligned buffer, see [2]. That means we're storing 12 bits of 0 at the lower end for a dma address, we can reuse those bits for the above arches to support 32b+12b, which is 16TB of memory. If we make a wrong assumption, a warning is emitted so that user can report to us. 1. https://lore.kernel.org/all/20211117075652.58299-1-linyunsheng@huawei.com/ 2. https://lore.kernel.org/all/20230818145145.4b357c89@kernel.org/Tested-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> CC: Lorenzo Bianconi <lorenzo@kernel.org> CC: Alexander Duyck <alexander.duyck@gmail.com> CC: Liang Chen <liangchen.linux@gmail.com> CC: Guillaume Tucker <guillaume.tucker@collabora.com> CC: Matthew Wilcox <willy@infradead.org> CC: Linux-MM <linux-mm@kvack.org> Link: https://lore.kernel.org/r/20231013064827.61135-2-linyunsheng@huawei.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jacob Keller authored
A few networking drivers including bnx2x, bnxt, qede, and idpf call tcp_gro_complete as part of offloading TCP GRO. The function is only defined if CONFIG_INET is true, since its TCP specific and is meaningless if the kernel lacks IP networking support. The combination of trying to use the complex network drivers with CONFIG_NET but not CONFIG_INET is rather unlikely in practice: most use cases are going to need IP networking. The tcp_gro_complete function just sets some data in the socket buffer for use in processing the TCP packet in the event that the GRO was offloaded to the device. If the kernel lacks TCP support, such setup will simply go unused. The bnx2x, bnxt, and qede drivers wrap their TCP offload support in CONFIG_INET checks and skip handling on such kernels. The idpf driver did not check CONFIG_INET and thus fails to link if the kernel is configured with CONFIG_NET=y, CONFIG_IDPF=(m|y), and CONFIG_INET=n. While checking CONFIG_INET does allow the driver to bypass significantly more instructions in the event that we know TCP networking isn't supported, the configuration is unlikely to be used widely. Rather than require driver authors to care about this, stub the tcp_gro_complete function when CONFIG_INET=n. This allows drivers to be left as-is. It does mean the idpf driver will perform slightly more work than strictly necessary when CONFIG_INET=n, since it will still execute some of the skb setup in idpf_rx_rsc. However, that work would be performed in the case where CONFIG_INET=y anyways. I did not change the existing drivers, since they appear to wrap a significant portion of code when CONFIG_INET=n. There is little benefit in trashing these drivers just to unwrap and remove the CONFIG_INET check. Using a stub for tcp_gro_complete is still beneficial, as it means future drivers no longer need to worry about this case of CONFIG_NET=y and CONFIG_INET=n, which should reduce noise from buildbots that check such a configuration. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> # build-tested Link: https://lore.kernel.org/r/20231013185502.1473541-1-jacob.e.keller@intel.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
- 16 Oct, 2023 1 commit
-
-
Muhammad Muzammil authored
resolved typing mistake from devce to device Signed-off-by: Muhammad Muzammil <m.muzzammilashraf@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20231013042304.7881-1-m.muzzammilashraf@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-