Merge branch 'fix-rtnl_mutex-deadlock-with-dpaa2-and-sfp-modules'
Vladimir Oltean says: ==================== Fix rtnl_mutex deadlock with DPAA2 and SFP modules This patch set deliberately targets net-next and lacks Fixes: tags due to caution on my part. While testing some SFP modules on the Solidrun Honeycomb LX2K platform, I noticed that rebooting causes a deadlock: ============================================ WARNING: possible recursive locking detected 6.1.0-rc5-07010-ga9b9500ffaac-dirty #656 Not tainted -------------------------------------------- systemd-shutdow/1 is trying to acquire lock: ffffa62db6cf42f0 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_lock+0x1c/0x30 but task is already holding lock: ffffa62db6cf42f0 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_lock+0x1c/0x30 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(rtnl_mutex); lock(rtnl_mutex); *** DEADLOCK *** May be due to missing lock nesting notation 6 locks held by systemd-shutdow/1: #0: ffffa62db6863c70 (system_transition_mutex){+.+.}-{4:4}, at: __do_sys_reboot+0xd4/0x260 #1: ffff2f2b0176f100 (&dev->mutex){....}-{4:4}, at: device_shutdown+0xf4/0x260 #2: ffff2f2b017be900 (&dev->mutex){....}-{4:4}, at: device_shutdown+0x104/0x260 #3: ffff2f2b017680f0 (&dev->mutex){....}-{4:4}, at: device_release_driver_internal+0x40/0x260 #4: ffff2f2b0e1608f0 (&dev->mutex){....}-{4:4}, at: device_release_driver_internal+0x40/0x260 #5: ffffa62db6cf42f0 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_lock+0x1c/0x30 stack backtrace: CPU: 6 PID: 1 Comm: systemd-shutdow Not tainted 6.1.0-rc5-07010-ga9b9500ffaac-dirty #656 Hardware name: SolidRun LX2160A Honeycomb (DT) Call trace: lock_acquire+0x68/0x84 __mutex_lock+0x98/0x460 mutex_lock_nested+0x2c/0x40 rtnl_lock+0x1c/0x30 sfp_bus_del_upstream+0x1c/0xac phylink_destroy+0x1c/0x50 dpaa2_mac_disconnect+0x28/0x70 dpaa2_eth_remove+0x1dc/0x1f0 fsl_mc_driver_remove+0x24/0x60 device_remove+0x70/0x80 device_release_driver_internal+0x1f0/0x260 device_links_unbind_consumers+0xe0/0x110 device_release_driver_internal+0x138/0x260 device_release_driver+0x18/0x24 bus_remove_device+0x12c/0x13c device_del+0x16c/0x424 fsl_mc_device_remove+0x28/0x40 __fsl_mc_device_remove+0x10/0x20 device_for_each_child+0x5c/0xac dprc_remove+0x94/0xb4 fsl_mc_driver_remove+0x24/0x60 device_remove+0x70/0x80 device_release_driver_internal+0x1f0/0x260 device_release_driver+0x18/0x24 bus_remove_device+0x12c/0x13c device_del+0x16c/0x424 fsl_mc_bus_remove+0x8c/0x10c fsl_mc_bus_shutdown+0x10/0x20 platform_shutdown+0x24/0x3c device_shutdown+0x15c/0x260 kernel_restart+0x40/0xa4 __do_sys_reboot+0x1e4/0x260 __arm64_sys_reboot+0x24/0x30 But fixing this appears to be not so simple. The patch set represents my attempt to address it. In short, the problem is that dpaa2_mac_connect() and dpaa2_mac_disconnect() call 2 phylink functions in a row, one takes rtnl_lock() itself - phylink_create(), and one which requires rtnl_lock() to be held by the caller - phylink_fwnode_phy_connect(). The existing approach in the drivers is too simple. We take rtnl_lock() when calling dpaa2_mac_connect(), which is what results in the deadlock. Fixing just that creates another problem. The drivers make use of rtnl_lock() for serializing with other code paths too. I think I've found all those code paths, and established other mechanisms for serializing with them. ==================== Link: https://lore.kernel.org/r/20221129141221.872653-1-vladimir.oltean@nxp.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>
Showing
Please register or sign in to comment