• Dennis Dalessandro's avatar
    IB/core, ipoib: Do not overreact to SM LID change event · ba7d8117
    Dennis Dalessandro authored
    When IPoIB receives an SM LID change event, it reacts by flushing its
    path record cache and rejoining multicast groups. This is the same
    behavior it performs when it receives a reregistration event. This
    behavior is unnecessary as an SM may have database backup or
    synchronization mechanisms which permit the SM location or LID to change
    without loss of multicast membership and without impact to path records.
    
    Both opensm and the OPA FM issue reregistration events if a new SM is
    started (or restarted with a new config) or an SM event occurs which
    results in loss of multicast membership records by the SM (such as
    opensm failover) or the SM encounters new nodes with Active ports (such
    as after joining 2 fabrics by connecting switches via ISLs). Hence this
    event can be depended on as the trigger for IPoIB cache and multicast
    flushing.
    
    It appears that some drivers, such as qib, and hfi1 issue the
    IB_EVENT_SM_CHANGE but other drivers such as mlx4 and mlx5 do not.
    Empirical testing on Mellanox EDR using ibv_asyncwatch has confirmed
    that Mellanox EDR HCAs do not generate SM change events and that opensm
    does generate reregistration.
    
    An SM LID change event is generated by the mentioned drivers to reflect
    that sm_lid and/or sm_sl in the local port info has changed. The intent
    of this event is to permit applications and ULPs which have a local copy
    of this information (or an address handle using it) to update their
    information.
    
    The intent is that the reregistration event (caused by the SM via a bit
    in Set(PortInfo)) be used to inform nodes that they need to rejoin
    multicast groups, resubscribe for notices and potentially update path
    records.
    
    When an SM migrates or fails over, a SM LID change event can occur. In
    response IPoIB discards path records and multicast membership and loses
    connectivity until these records are restored via SA requests. In very
    large fabrics, it may take minutes for the SM to be ready and for the SA
    responses to be supplied.  This can result in undesirable and
    unnecessary IPoIB connectivity impacts. It also can result in an
    unnecessary storm of SA queries from all nodes in a cluster potentially
    followed by yet another storm if the SM issues the reregistration
    request.
    
    The fact the Mellanox HCAs do not even generate this event, is further
    evidence that on modern IB fabrics there will be no ill side effects
    from the proposed changes below to reduce the reaction by 3 kernel
    components to this event. So these changes should be benign for Mellanox
    IB fabrics and will benefit OPA fabrics while also making ib_core and
    ULP behavor "correct" as intended by the IBTA spec and kernel RDMA event
    APIs.
    
    Address these issues by removing IB_EVENT_SM_CHANGE handling from ipoib.
    IPoIB does not locally store sm_lid nor sm_sl, so it does not need to do
    anything on SM LID change. IPoIB makes use of other ib_core components
    to issue SA requests for it and those components correctly track SM LID
    and SM LID changes.
    
    Also in ib_core multicast handling,  remove the test for
    IB_EVENT_SM_CHANGE. This code is moving all multicast groups to the
    error state, which will trigger rejoins. This code is used by IPoIB as
    well as the connection manager and other clients of multicast groups.
    This kernel module centralizes group membership status and joins since a
    node can only join a given group once but multiple ULPs or applications
    may want to join the same group. It makes use of the sa_query.c
    component in ib_core, which correctly trackes SM LID and SL. This
    component does not track SM LID nor SL itself and hence need not react
    to their changes.
    
    Similarly in the ib_core cache code remove the handling for the
    IB_EVENT_SM_CHANGE.  In this function. The ib_cache_update function
    which is ultimately called is updating local copies of the pkey table,
    gid table and lmc. It does not update nor retain sm_lid nor sm_sl. As
    such it does not need to be called on an SM LID change. It technically
    also does not need to be called on a reregistration. The LID_CHANGE,
    PKEY_CHANGE, GID_CHANGE and port state change events (PORT_ERR,
    PORT_ACTICE) should be sufficient triggers.
    
    It is worth noting that the alternative of simply having the hfi1 and
    qib drivers not generate the SM LID change event was explored. While
    this would duplicate what Mellanox drivers do now, it is not the correct
    behavior and removes the ability for an SM to migrate without requiring
    reregistration. Since both opensm and OPA SM have mechanisms to backup
    or synchronize registration information, it is desirable to let them
    perform SM migrations (with LID or SL changes) without requiring
    reregistration when they deem it appropriate.
    Suggested-by: default avatarTodd Rimmer <todd.rimmer@intel.com>
    Tested-by: default avatarMichael Brooks <michael.brooks@intel.com>
    Reviewed-by: default avatarMike Marciniszyn <mike.marciniszyn@intel.com>
    Reviewed-by: default avatarTodd Rimmer <todd.rimmer@intel.com>
    Signed-off-by: default avatarDennis Dalessandro <dennis.dalessandro@intel.com>
    Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
    ba7d8117
cache.c 39.4 KB