• Xue jiufei's avatar
    ocfs2/dlm: disallow node joining when recovery is on going · 01c6222f
    Xue jiufei authored
    We found a race situation when dlm recovery and node joining occurs
    simultaneously if the network state is bad.
    
    N1                                      N4
    
    start joining dlm and send
    query join to all live nodes
                                set joining node to N1, return OK
    send query join to other
    live nodes and it may take
    a while
    
    call dlm_send_join_assert()
    to send assert join message
    when N2 is down, so keep
    trying to send message to N2
    until find N2 is down
    
    send assert join message to
    N3, but connection is down
    with N3, so it may take a
    while
                                become the recovery master for N2
                                and send begin reco message to other
                                nodes in domain map but no N1
    connection with N3 is rebuild,
    then send assert join to N4
                                call dlm_assert_joined_handler(),
                                add N1 to domain_map
    
                                dlm recovery done, send finalize message
                                to nodes in domain map, including N1
    receiving finalize message,
    trigger the BUG() because
    recovery master mismatch.
    Signed-off-by: default avatarjoyce.xue <xuejiufei@huawei.com>
    Cc: Mark Fasheh <mfasheh@suse.com>
    Cc: Joel Becker <jlbec@evilplan.org>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    01c6222f
dlmdomain.c 60.5 KB