• Sourabh Jain's avatar
    powerpc/fadump: fix race between pstore write and fadump crash trigger · ba608c4f
    Sourabh Jain authored
    When we enter into fadump crash path via system reset we fail to update
    the pstore.
    
    On the system reset path we first update the pstore then we go for fadump
    crash. But the problem here is when all the CPUs try to get the pstore
    lock to initiate the pstore write, only one CPUs will acquire the lock
    and proceed with the pstore write. Since it in NMI context CPUs that fail
    to get lock do not wait for their turn to write to the pstore and simply
    proceed with the next operation which is fadump crash. One of the CPU who
    proceeded with fadump crash path triggers the crash and does not wait for
    the CPU who gets the pstore lock to complete the pstore update.
    
    Timeline diagram to depicts the sequence of events that leads to an
    unsuccessful pstore update when we hit fadump crash path via system reset.
    
                     1    2     3    ...      n   CPU Threads
                     |    |     |             |
                     |    |     |             |
     Reached to   -->|--->|---->| ----------->|
     system reset    |    |     |             |
     path            |    |     |             |
                     |    |     |             |
     Try to       -->|--->|---->|------------>|
     acquire the     |    |     |             |
     pstore lock     |    |     |             |
                     |    |     |             |
                     |    |     |             |
     Got the      -->| +->|     |             |<-+
     pstore lock     | |  |     |             |  |-->  Didn't get the
                     | --------------------------+     lock and moving
                     |    |     |             |        ahead on fadump
                     |    |     |             |        crash path
                     |    |     |             |
      Begins the  -->|    |     |             |
      process to     |    |     |             |<-- Got the chance to
      update the     |    |     |             |    trigger the crash
      pstore         | -> |     |    ... <-   |
                     | |  |     |         |   |
                     | |  |     |         |   |<-- Triggers the
                     | |  |     |         |   |    crash
                     | |  |     |         |   |      ^
                     | |  |     |         |   |      |
      Writing to  -->| |  |     |         |   |      |
      pstore         | |  |     |         |   |      |
                       |                  |          |
           ^           |__________________|          |
           |               CPU Relax                 |
           |                                         |
           +-----------------------------------------+
                              |
                              v
                Race: crash triggered before pstore
                      update completes
    
    To avoid this race condition a barrier is added on crash_fadump path, it
    prevents the CPU to trigger the crash until all the online CPUs completes
    their task.
    
    A barrier is added to make sure all the secondary CPUs hit the
    crash_fadump function before we initiates the crash. A timeout is kept to
    ensure the primary CPU (one who initiates the crash) do not wait for
    secondary CPUs indefinitely.
    Signed-off-by: default avatarSourabh Jain <sourabhjain@linux.ibm.com>
    Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
    Link: https://lore.kernel.org/r/20200713052435.183750-1-sourabhjain@linux.ibm.com
    ba608c4f
fadump.c 42.5 KB