• Jon Maloy's avatar
    tipc: fix lockdep warning during node delete · ec835f89
    Jon Maloy authored
    We see the following lockdep warning:
    
    [ 2284.078521] ======================================================
    [ 2284.078604] WARNING: possible circular locking dependency detected
    [ 2284.078604] 4.19.0+ #42 Tainted: G            E
    [ 2284.078604] ------------------------------------------------------
    [ 2284.078604] rmmod/254 is trying to acquire lock:
    [ 2284.078604] 00000000acd94e28 ((&n->timer)#2){+.-.}, at: del_timer_sync+0x5/0xa0
    [ 2284.078604]
    [ 2284.078604] but task is already holding lock:
    [ 2284.078604] 00000000f997afc0 (&(&tn->node_list_lock)->rlock){+.-.}, at: tipc_node_stop+0xac/0x190 [tipc]
    [ 2284.078604]
    [ 2284.078604] which lock already depends on the new lock.
    [ 2284.078604]
    [ 2284.078604]
    [ 2284.078604] the existing dependency chain (in reverse order) is:
    [ 2284.078604]
    [ 2284.078604] -> #1 (&(&tn->node_list_lock)->rlock){+.-.}:
    [ 2284.078604]        tipc_node_timeout+0x20a/0x330 [tipc]
    [ 2284.078604]        call_timer_fn+0xa1/0x280
    [ 2284.078604]        run_timer_softirq+0x1f2/0x4d0
    [ 2284.078604]        __do_softirq+0xfc/0x413
    [ 2284.078604]        irq_exit+0xb5/0xc0
    [ 2284.078604]        smp_apic_timer_interrupt+0xac/0x210
    [ 2284.078604]        apic_timer_interrupt+0xf/0x20
    [ 2284.078604]        default_idle+0x1c/0x140
    [ 2284.078604]        do_idle+0x1bc/0x280
    [ 2284.078604]        cpu_startup_entry+0x19/0x20
    [ 2284.078604]        start_secondary+0x187/0x1c0
    [ 2284.078604]        secondary_startup_64+0xa4/0xb0
    [ 2284.078604]
    [ 2284.078604] -> #0 ((&n->timer)#2){+.-.}:
    [ 2284.078604]        del_timer_sync+0x34/0xa0
    [ 2284.078604]        tipc_node_delete+0x1a/0x40 [tipc]
    [ 2284.078604]        tipc_node_stop+0xcb/0x190 [tipc]
    [ 2284.078604]        tipc_net_stop+0x154/0x170 [tipc]
    [ 2284.078604]        tipc_exit_net+0x16/0x30 [tipc]
    [ 2284.078604]        ops_exit_list.isra.8+0x36/0x70
    [ 2284.078604]        unregister_pernet_operations+0x87/0xd0
    [ 2284.078604]        unregister_pernet_subsys+0x1d/0x30
    [ 2284.078604]        tipc_exit+0x11/0x6f2 [tipc]
    [ 2284.078604]        __x64_sys_delete_module+0x1df/0x240
    [ 2284.078604]        do_syscall_64+0x66/0x460
    [ 2284.078604]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [ 2284.078604]
    [ 2284.078604] other info that might help us debug this:
    [ 2284.078604]
    [ 2284.078604]  Possible unsafe locking scenario:
    [ 2284.078604]
    [ 2284.078604]        CPU0                    CPU1
    [ 2284.078604]        ----                    ----
    [ 2284.078604]   lock(&(&tn->node_list_lock)->rlock);
    [ 2284.078604]                                lock((&n->timer)#2);
    [ 2284.078604]                                lock(&(&tn->node_list_lock)->rlock);
    [ 2284.078604]   lock((&n->timer)#2);
    [ 2284.078604]
    [ 2284.078604]  *** DEADLOCK ***
    [ 2284.078604]
    [ 2284.078604] 3 locks held by rmmod/254:
    [ 2284.078604]  #0: 000000003368be9b (pernet_ops_rwsem){+.+.}, at: unregister_pernet_subsys+0x15/0x30
    [ 2284.078604]  #1: 0000000046ed9c86 (rtnl_mutex){+.+.}, at: tipc_net_stop+0x144/0x170 [tipc]
    [ 2284.078604]  #2: 00000000f997afc0 (&(&tn->node_list_lock)->rlock){+.-.}, at: tipc_node_stop+0xac/0x19
    [...}
    
    The reason is that the node timer handler sometimes needs to delete a
    node which has been disconnected for too long. To do this, it grabs
    the lock 'node_list_lock', which may at the same time be held by the
    generic node cleanup function, tipc_node_stop(), during module removal.
    Since the latter is calling del_timer_sync() inside the same lock, we
    have a potential deadlock.
    
    We fix this letting the timer cleanup function use spin_trylock()
    instead of just spin_lock(), and when it fails to grab the lock it
    just returns so that the timer handler can terminate its execution.
    This is safe to do, since tipc_node_stop() anyway is about to
    delete both the timer and the node instance.
    
    Fixes: 6a939f36 ("tipc: Auto removal of peer down node instance")
    Acked-by: default avatarYing Xue <ying.xue@windriver.com>
    Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
    Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    ec835f89
node.c 59.2 KB