• Russ Anderson's avatar
    [IA64] Fix race when multiple cpus go through MCA · e1b1eb01
    Russ Anderson authored
    Additional testing uncovered a situation where the MCA recovery code could
    hang due to a race condition.
    
    According to the SAL spec, SAL sends a rendezvous interrupt to all but the first
    CPU that goes into MCA.  This includes other CPUs that go into MCA at the same
    time.  Those other CPUs will go into the linux MCA handler (rather than the
    slave loop) with the rendezvous interrupt pending.  When all the CPUs have
    completed MCA processing and the last monarch completes, freeing all the CPUs,
    the CPUs with the pended rendezvous interrupt then go into the
    ia64_mca_rendez_int_handler().  In ia64_mca_rendez_int_handler() the CPUs
    get marked as rendezvoused, but then leave the handler (due to no MCA).
    That leaves the CPUs marked as rendezvoused _before_ the next MCA event.
    
    When the next MCA hits, the monarch will mistakenly believe that all the CPUs
    are rendezvoused when they are not, opening up a window where a CPU can get
    stuck in the slave loop.
    
    This patch avoids leaving CPUs marked as rendezvoused when they are not.
    Signed-off-by: default avatarRuss Anderson <rja@sgi.com>
    Signed-off-by: default avatarTony Luck <tony.luck@intel.com>
    e1b1eb01
mca.c 59.1 KB