Merge branch 'net-netconsole-fix-netconsole-unsafe-locking'
Breno Leitao says: ==================== net: netconsole: Fix netconsole unsafe locking Problem: ======= The current locking mechanism in netconsole is unsafe and suboptimal due to the following issues: 1) Lock Release and Reacquisition Mid-Loop: In netconsole_netdev_event(), the target_list_lock is released and reacquired within a loop, potentially causing collisions and cleaning up targets that are being enabled. int netconsole_netdev_event() { ... spin_lock_irqsave(&target_list_lock, flags); list_for_each_entry(nt, &target_list, list) { spin_unlock_irqrestore(&target_list_lock, flags); __netpoll_cleanup(&nt->np); spin_lock_irqsave(&target_list_lock, flags); } spin_lock_irqsave(&target_list_lock, flags); ... } 2) Non-Atomic Cleanup Operations: In enabled_store(), the cleanup of structures is not atomic, risking cleanup of structures that are in the process of being enabled. size_t enabled_store() { ... spin_lock_irqsave(&target_list_lock, flags); nt->enabled = false; spin_unlock_irqrestore(&target_list_lock, flags); netpoll_cleanup(&nt->np); ... } These issues stem from the following limitations in netconsole's locking design: 1) write_{ext_}msg() functions: a) Cannot sleep b) Must iterate through targets and send messages to all enabled entries. c) List iteration is protected by target_list_lock spinlock. 2) Network event handling in netconsole_netdev_event(): a) Needs to sleep b) Requires iteration over the target list (holding target_list_lock spinlock). c) Some events necessitate netpoll struct cleanup, which *needs* to sleep. The target_list_lock needs to be used by non-sleepable functions while also protecting operations that may sleep, leading to the current unsafe design. Solution: ======== 1) Dual Locking Mechanism: - Retain current target_list_lock for non-sleepable use cases. - Introduce target_cleanup_list_lock (mutex) for sleepable operations. 2) Deferred Cleanup: - Implement atomic, deferred cleanup of structures using the new mutex (target_cleanup_list_lock). - Avoid the `goto` in the middle of the list_for_each_entry 3) Separate Cleanup List: - Create target_cleanup_list for deferred cleanup, protected by target_cleanup_list_lock. - This allows cleanup() to sleep without affecting message transmission. - When iterating over targets, move devices needing cleanup to target_cleanup_list. - Handle cleanup under the target_cleanup_list_lock mutex. 4) Make a clear locking hierarchy - The target_cleanup_list_lock takes precedence over target_list_lock. - Major Workflow Locking Sequences: a) Network Event Affecting Netpoll (netconsole_netdev_event): rtnl -> target_cleanup_list_lock -> target_list_lock b) Message Writing (write_msg()): console_lock -> target_list_lock c) Configfs Target Enable/Disable (enabled_store()): dynamic_netconsole_mutex -> target_cleanup_list_lock -> target_list_lock This hierarchy ensures consistent lock acquisition order across different operations, preventing deadlocks and maintaining proper synchronization. The target_cleanup_list_lock's higher priority allows for safe deferred cleanup operations without interfering with regular message transmission protected by target_list_lock. Each workflow follows a specific locking sequence, ensuring that operations like network event handling, message writing, and target management are properly synchronized and do not conflict with each other. Changelog: v3: * Move netconsole_process_cleanups() function to inside CONFIG_NETCONSOLE_DYNAMIC block, avoiding Werror=unused-function (Jakub) v2: * The selftest has been removed from the patchset because veth is now IFF_DISABLE_NETPOLL. A new test will be sent separately. * https://lore.kernel.org/all/20240807091657.4191542-1-leitao@debian.org/ v1: * https://lore.kernel.org/all/20240801161213.2707132-1-leitao@debian.org/ ==================== Link: https://patch.msgid.link/20240808122518.498166-1-leitao@debian.orgSigned-off-by: Paolo Abeni <pabeni@redhat.com>
Showing
This diff is collapsed.
Please register or sign in to comment