• David Teigland's avatar
    [DLM] fix master recovery · 222d3960
    David Teigland authored
    If master recovery happens on an rsb in one recovery sequence, then that
    sequence is aborted before lock recovery happens, then in the next
    sequence, we rely on the previous master recovery (which may now be
    invalid due to another node ignoring a lookup result) and go on do to the
    lock recovery where we get stuck due to an invalid master value.
    
     recovery cycle begins: master of rsb X has left
     nodes A and B send node C an rcom lookup for X to find the new master
     C gets lookup from B first, sets B as new master, and sends reply back to B
     C gets lookup from A next, and sends reply back to A saying B is master
     A gets lookup reply from C and sets B as the new master in the rsb
     recovery cycle on A, B and C is aborted to start a new recovery
     B gets lookup reply from C and ignores it since there's a new recovery
     recovery cycle begins: some other node has joined
     B doesn't think it's the master of X so it doesn't rebuild it in the directory
     C looks up the master of X, no one is master, so it becomes new master
     B looks up the master of X, finds it's C
     A believes that B is the master of X, so it sends its lock to B
     B sends an error back to A
     A resends
     this repeats forever, the incorrect master value on A is never corrected
    
    The fix is to do master recovery on an rsb that still has the NEW_MASTER
    flag set from an earlier recovery sequence, and therefore didn't complete
    lock recovery.
    Signed-off-by: default avatarDavid Teigland <teigland@redhat.com>
    Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
    222d3960
recover.c 18.2 KB