• Mahesh Salgaonkar's avatar
    powerpc/mce: Fix a bug where mce loops on memory UE. · 75ecfb49
    Mahesh Salgaonkar authored
    The current code extracts the physical address for UE errors and then
    hooks it up into memory failure infrastructure. On successful
    extraction of physical address it wrongly sets "handled = 1" which
    means this UE error has been recovered. Since MCE handler gets return
    value as handled = 1, it assumes that error has been recovered and
    goes back to same NIP. This causes MCE interrupt again and again in a
    loop leading to hard lockup.
    
    Also, initialize phys_addr to ULONG_MAX so that we don't end up
    queuing undesired page to hwpoison.
    
    Without this patch we see:
      Severe Machine check interrupt [Recovered]
        NIP: [000000001002588c] PID: 7109 Comm: find
        Initiator: CPU
        Error type: UE [Load/Store]
          Effective address: 00007fffd2755940
          Physical address:  000020181a080000
      ...
      Severe Machine check interrupt [Recovered]
        NIP: [000000001002588c] PID: 7109 Comm: find
        Initiator: CPU
        Error type: UE [Load/Store]
          Effective address: 00007fffd2755940
          Physical address:  000020181a080000
      Severe Machine check interrupt [Recovered]
        NIP: [000000001002588c] PID: 7109 Comm: find
        Initiator: CPU
        Error type: UE [Load/Store]
          Effective address: 00007fffd2755940
          Physical address:  000020181a080000
      Memory failure: 0x20181a08: recovery action for dirty LRU page: Recovered
      Memory failure: 0x20181a08: already hardware poisoned
      Memory failure: 0x20181a08: already hardware poisoned
      Memory failure: 0x20181a08: already hardware poisoned
      Memory failure: 0x20181a08: already hardware poisoned
      Memory failure: 0x20181a08: already hardware poisoned
      Memory failure: 0x20181a08: already hardware poisoned
      ...
      Watchdog CPU:38 Hard LOCKUP
    
    After this patch we see:
    
      Severe Machine check interrupt [Not recovered]
        NIP: [00007fffaae585f4] PID: 7168 Comm: find
        Initiator: CPU
        Error type: UE [Load/Store]
          Effective address: 00007fffaafe28ac
          Physical address:  00002017c0bd0000
      find[7168]: unhandled signal 7 at 00007fffaae585f4 nip 00007fffaae585f4 lr 00007fffaae585e0 code 4
      Memory failure: 0x2017c0bd: recovery action for dirty LRU page: Recovered
    
    Fixes: 01eaac2b ("powerpc/mce: Hookup ierror (instruction) UE errors")
    Fixes: ba41e1e1 ("powerpc/mce: Hookup derror (load/store) UE errors")
    Cc: stable@vger.kernel.org # v4.15+
    Signed-off-by: default avatarMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
    Signed-off-by: default avatarBalbir Singh <bsingharora@gmail.com>
    Reviewed-by: default avatarBalbir Singh <bsingharora@gmail.com>
    Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
    75ecfb49
mce_power.c 18.9 KB