• Tuong Lien's avatar
    tipc: fix issues with early FAILOVER_MSG from peer · d0f84d08
    Tuong Lien authored
    It appears that a FAILOVER_MSG can come from peer even when the failure
    link is resetting (i.e. just after the 'node_write_unlock()'...). This
    means the failover procedure on the node has not been started yet.
    The situation is as follows:
    
             node1                                node2
      linkb          linka                  linka        linkb
        |              |                      |            |
        |              |                      x failure    |
        |              |                  RESETTING        |
        |              |                      |            |
        |              x failure            RESET          |
        |          RESETTING             FAILINGOVER       |
        |              |   (FAILOVER_MSG)     |            |
        |<-------------------------------------------------|
        | *FAILINGOVER |                      |            |
        |              | (dummy FAILOVER_MSG) |            |
        |------------------------------------------------->|
        |            RESET                    |            | FAILOVER_END
        |         FAILINGOVER               RESET          |
        .              .                      .            .
        .              .                      .            .
        .              .                      .            .
    
    Once this happens, the link failover procedure will be triggered
    wrongly on the receiving node since the node isn't in FAILINGOVER state
    but then another link failover will be carried out.
    The consequences are:
    
    1) A peer might get stuck in FAILINGOVER state because the 'sync_point'
    was set, reset and set incorrectly, the criteria to end the failover
    would not be met, it could keep waiting for a message that has already
    received.
    
    2) The early FAILOVER_MSG(s) could be queued in the link failover
    deferdq but would be purged or not pulled out because the 'drop_point'
    was not set correctly.
    
    3) The early FAILOVER_MSG(s) could be dropped too.
    
    4) The dummy FAILOVER_MSG could make the peer leaving FAILINGOVER state
    shortly, but later on it would be restarted.
    
    The same situation can also happen when the link is in PEER_RESET state
    and a FAILOVER_MSG arrives.
    
    The commit resolves the issues by forcing the link down immediately, so
    the failover procedure will be started normally (which is the same as
    when receiving a FAILOVER_MSG and the link is in up state).
    
    Also, the function "tipc_node_link_failover()" is toughen to avoid such
    a situation from happening.
    Acked-by: default avatarJon Maloy <jon.maloy@ericsson.se>
    Signed-off-by: default avatarTuong Lien <tuong.t.lien@dektech.com.au>
    Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    d0f84d08
link.c 71.4 KB