• Tuong Lien's avatar
    tipc: fix missing Name entries due to half-failover · c0b14a08
    Tuong Lien authored
    TIPC link can temporarily fall into "half-establish" that only one of
    the link endpoints is ESTABLISHED and starts to send traffic, PROTOCOL
    messages, whereas the other link endpoint is not up (e.g. immediately
    when the endpoint receives ACTIVATE_MSG, the network interface goes
    down...).
    
    This is a normal situation and will be settled because the link
    endpoint will be eventually brought down after the link tolerance time.
    
    However, the situation will become worse when the second link is
    established before the first link endpoint goes down,
    For example:
    
       1. Both links <1A-2A>, <1B-2B> down
       2. Link endpoint 2A up, but 1A still down (e.g. due to network
          disturbance, wrong session, etc.)
       3. Link <1B-2B> up
       4. Link endpoint 2A down (e.g. due to link tolerance timeout)
       5. Node B starts failover onto link <1B-2B>
    
       ==> Node A does never start link failover.
    
    When the "half-failover" situation happens, two consequences have been
    observed:
    
    a) Peer link/node gets stuck in FAILINGOVER state;
    b) Traffic or user messages that peer node is trying to failover onto
    the second link can be partially or completely dropped by this node.
    
    The consequence a) was actually solved by commit c140eb16 ("tipc:
    fix failover problem"), but that commit didn't cover the b). It's due
    to the fact that the tunnel link endpoint has never been prepared for a
    failover, so the 'l->drop_point' (and the other data...) is not set
    correctly. When a TUNNEL_MSG from peer node arrives on the link,
    depending on the inner message's seqno and the current 'l->drop_point'
    value, the message can be dropped (- treated as a duplicate message) or
    processed.
    At this early stage, the traffic messages from peer are likely to be
    NAME_DISTRIBUTORs, this means some name table entries will be missed on
    the node forever!
    
    The commit resolves the issue by starting the FAILOVER process on this
    node as well. Another benefit from this solution is that we ensure the
    link will not be re-established until the failover ends.
    Acked-by: default avatarJon Maloy <jon.maloy@ericsson.com>
    Signed-off-by: default avatarTuong Lien <tuong.t.lien@dektech.com.au>
    Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    c0b14a08
link.c 71.5 KB