1. 02 May, 2019 34 commits
  2. 01 May, 2019 6 commits
    • Breno Leitao's avatar
      powerpc/tm: Avoid machine crash on rt_sigreturn() · e620d450
      Breno Leitao authored
      There is a kernel crash that happens if rt_sigreturn() is called inside
      a transactional block.
      
      This crash happens if the kernel hits an in-kernel page fault when
      accessing userspace memory, usually through copy_ckvsx_to_user(). A
      major page fault calls might_sleep() function, which can cause a task
      reschedule. A task reschedule (switch_to()) reclaim and recheckpoint
      the TM states, but, in the signal return path, the checkpointed memory
      was already reclaimed, thus the exception stack has MSR that points to
      MSR[TS]=0.
      
      When the code returns from might_sleep() and a task reschedule
      happened, then this task is returned with the memory recheckpointed,
      and CPU MSR[TS] = suspended.
      
      This means that there is a side effect at might_sleep() if it is
      called with CPU MSR[TS] = 0 and the task has regs->msr[TS] != 0.
      
      This side effect can cause a TM bad thing, since at the exception
      entrance, the stack saves MSR[TS]=0, and this is what will be used at
      RFID, but, the processor has MSR[TS] = Suspended, and this transition
      will be invalid and a TM Bad thing will be raised, causing the
      following crash:
      
        Unexpected TM Bad Thing exception at c00000000000e9ec (msr 0x8000000302a03031) tm_scratch=800000010280b033
        cpu 0xc: Vector: 700 (Program Check) at [c00000003ff1fd70]
            pc: c00000000000e9ec: fast_exception_return+0x100/0x1bc
            lr: c000000000032948: handle_rt_signal64+0xb8/0xaf0
            sp: c0000004263ebc40
           msr: 8000000302a03031
          current = 0xc000000415050300
          paca    = 0xc00000003ffc4080	 irqmask: 0x03	 irq_happened: 0x01
            pid   = 25006, comm = sigfuz
        Linux version 5.0.0-rc1-00001-g3bd6e94b (breno@debian) (gcc version 8.2.0 (Debian 8.2.0-3)) #899 SMP Mon Jan 7 11:30:07 EST 2019
        WARNING: exception is not recoverable, can't continue
        enter ? for help
        [c0000004263ebc40] c000000000032948 handle_rt_signal64+0xb8/0xaf0 (unreliable)
        [c0000004263ebd30] c000000000022780 do_notify_resume+0x2f0/0x430
        [c0000004263ebe20] c00000000000e844 ret_from_except_lite+0x70/0x74
        --- Exception: c00 (System Call) at 00007fffbaac400c
        SP (7fffeca90f40) is in userspace
      
      The solution for this problem is running the sigreturn code with
      regs->msr[TS] disabled, thus, avoiding hitting the side effect above.
      This does not seem to be a problem since regs->msr will be replaced by
      the ucontext value, so, it is being flushed already. In this case, it
      is flushed earlier.
      Signed-off-by: default avatarBreno Leitao <leitao@debian.org>
      Acked-by: default avatarMichael Neuling <mikey@neuling.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      e620d450
    • Aneesh Kumar K.V's avatar
      powerpc/mm/radix: Fix kernel crash when running subpage protect test · 2c474c03
      Aneesh Kumar K.V authored
      This patch fixes the below crash by making sure we touch the subpage
      protection related structures only if we know they are allocated on
      the platform. With radix translation we don't allocate hash context at
      all and trying to access subpage_prot_table results in:
      
        Faulting instruction address: 0xc00000000008bdb4
        Oops: Kernel access of bad area, sig: 11 [#1]
        LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV
        ....
        NIP [c00000000008bdb4] sys_subpage_prot+0x74/0x590
        LR [c00000000000b688] system_call+0x5c/0x70
        Call Trace:
        [c00020002c6b7d30] [c00020002c6b7d90] 0xc00020002c6b7d90 (unreliable)
        [c00020002c6b7e20] [c00000000000b688] system_call+0x5c/0x70
        Instruction dump:
        fb61ffd8 fb81ffe0 fba1ffe8 fbc1fff0 fbe1fff8 f821ff11 e92d1178 f9210068
        39200000 e92d0968 ebe90630 e93f03e8 <eb891038> 60000000 3860fffe e9410068
      
      We also move the subpage_prot_table with mmp_sem held to avoid race
      between two parallel subpage_prot syscall.
      
      Fixes: 70110186 ("powerpc/mm: Reduce memory usage for mm_context_t for radix")
      Reported-by: default avatarSachin Sant <sachinp@linux.ibm.com>
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Tested-by: default avatarSachin Sant <sachinp@linux.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      2c474c03
    • Mahesh Salgaonkar's avatar
      powerpc/powernv/mce: Print additional information about MCE error. · 50dbabe0
      Mahesh Salgaonkar authored
      Print more information about MCE error whether it is an hardware or
      software error.
      
      Some of the MCE errors can be easily categorized as hardware or
      software errors e.g. UEs are due to hardware error, where as error
      triggered due to invalid usage of tlbie is a pure software bug. But
      not all the MCE errors can be easily categorize into either software
      or hardware. There are errors like multihit errors which are usually
      result of a software bug, but in some rare cases a hardware failure
      can cause a multihit error. In past, we have seen case where after
      replacing faulty chip, multihit errors stopped occurring. Same with
      parity errors, which are usually due to faulty hardware but there are
      chances where multihit can also cause an parity error. Such errors are
      difficult to determine what really caused it. Hence this patch
      classifies MCE errors into following four categorize:
      
        1. Hardware error:
        	UE and Link timeout failure errors.
        2. Probable hardware error (some chance of software cause)
        	SLB/ERAT/TLB Parity errors.
        3. Software error
        	Invalid tlbie form.
        4. Probable software error (some chance of hardware cause)
        	SLB/ERAT/TLB Multihit errors.
      
      Sample output:
      
        MCE: CPU80: machine check (Warning) Guest SLB Multihit DAR: 000001001b6e0320 [Recovered]
        MCE: CPU80: PID: 24765 Comm: qemu-system-ppc Guest NIP: [00007fffa309dc60]
        MCE: CPU80: Probable Software error (some chance of hardware cause)
      Signed-off-by: default avatarMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      50dbabe0
    • Mahesh Salgaonkar's avatar
      powerpc/powernv/mce: Print correct severity for MCE error. · cda6618d
      Mahesh Salgaonkar authored
      Currently all machine check errors are printed as severe errors which
      isn't correct. Print soft errors as warning instead of severe errors.
      Signed-off-by: default avatarMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      cda6618d
    • Mahesh Salgaonkar's avatar
      powerpc/powernv/mce: Reduce MCE console logs to lesser lines. · d6e8a150
      Mahesh Salgaonkar authored
      Also add cpu number while displaying MCE log. This will help cleaner
      logs when MCE hits on multiple cpus simultaneously.
      
      Before the changes the MCE output was:
      
        Severe Machine check interrupt [Recovered]
          NIP [d00000000ba80280]: insert_slb_entry.constprop.0+0x278/0x2c0 [mcetest_slb]
          Initiator: CPU
          Error type: SLB [Multihit]
            Effective address: d00000000ba80280
      
      After this patch series changes the MCE output will be:
      
        MCE: CPU80: machine check (Warning) Host SLB Multihit [Recovered]
        MCE: CPU80: NIP: [d00000000b550280] insert_slb_entry.constprop.0+0x278/0x2c0 [mcetest_slb]
        MCE: CPU80: Probable software error (some chance of hardware cause)
      
      UE in host application:
      
        MCE: CPU48: machine check (Severe) Host UE Load/Store DAR: 00007fffc6079a80 paddr: 0000000f8e260000 [Not recovered]
        MCE: CPU48: PID: 4584 Comm: find NIP: [0000000010023368]
        MCE: CPU48: Hardware error
      
      and for MCE in Guest:
      
        MCE: CPU80: machine check (Warning) Guest SLB Multihit DAR: 000001001b6e0320 [Recovered]
        MCE: CPU80: PID: 24765 Comm: qemu-system-ppc Guest NIP: [00007fffa309dc60]
        MCE: CPU80: Probable software error (some chance of hardware cause)
      Signed-off-by: default avatarMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      d6e8a150
    • Anton Blanchard's avatar
      powerpc: Add doorbell tracepoints · 5b2a1529
      Anton Blanchard authored
      When analysing sources of OS jitter, I noticed that doorbells cannot be
      traced.
      Signed-off-by: default avatarAnton Blanchard <anton@ozlabs.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      5b2a1529