1. 02 Dec, 2021 5 commits
  2. 01 Dec, 2021 2 commits
  3. 30 Nov, 2021 8 commits
    • Christophe Leroy's avatar
      powerpc/32s: Fix shift-out-of-bounds in KASAN init · af11dee4
      Christophe Leroy authored
      ================================================================================
      UBSAN: shift-out-of-bounds in arch/powerpc/mm/kasan/book3s_32.c:22:23
      shift exponent -1 is negative
      CPU: 0 PID: 0 Comm: swapper Not tainted 5.15.5-gentoo-PowerMacG4 #9
      Call Trace:
      [c214be60] [c0ba0048] dump_stack_lvl+0x80/0xb0 (unreliable)
      [c214be80] [c0b99288] ubsan_epilogue+0x10/0x5c
      [c214be90] [c0b98fe0] __ubsan_handle_shift_out_of_bounds+0x94/0x138
      [c214bf00] [c1c0f010] kasan_init_region+0xd8/0x26c
      [c214bf30] [c1c0ed84] kasan_init+0xc0/0x198
      [c214bf70] [c1c08024] setup_arch+0x18/0x54c
      [c214bfc0] [c1c037f0] start_kernel+0x90/0x33c
      [c214bff0] [00003610] 0x3610
      ================================================================================
      
      This happens when the directly mapped memory is a power of 2.
      
      Fix it by checking the shift and set the result to 0 when shift is -1
      
      Fixes: 7974c473 ("powerpc/32s: Implement dedicated kasan_init_region()")
      Reported-by: default avatarErhard Furtner <erhard_f@mailbox.org>
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=215169
      Link: https://lore.kernel.org/r/15cbc3439d4ad988b225e2119ec99502a5cc6ad3.1638261744.git.christophe.leroy@csgroup.eu
      af11dee4
    • Christophe Leroy's avatar
      powerpc/powermac: Add missing lockdep_register_key() · df1f679d
      Christophe Leroy authored
      KeyWest i2c @0xf8001003 irq 42 /uni-n@f8000000/i2c@f8001000
      BUG: key c2d00cbc has not been registered!
      ------------[ cut here ]------------
      DEBUG_LOCKS_WARN_ON(1)
      WARNING: CPU: 0 PID: 1 at kernel/locking/lockdep.c:4801 lockdep_init_map_type+0x4c0/0xb4c
      Modules linked in:
      CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.15.5-gentoo-PowerMacG4 #9
      NIP:  c01a9428 LR: c01a9428 CTR: 00000000
      REGS: e1033cf0 TRAP: 0700   Not tainted  (5.15.5-gentoo-PowerMacG4)
      MSR:  00029032 <EE,ME,IR,DR,RI>  CR: 24002002  XER: 00000000
      
      GPR00: c01a9428 e1033db0 c2d1cf20 00000016 00000004 00000001 c01c0630 e1033a73
      GPR08: 00000000 00000000 00000000 e1033db0 24002004 00000000 f8729377 00000003
      GPR16: c1829a9c 00000000 18305357 c1416fc0 c1416f80 c006ac60 c2d00ca8 c1416f00
      GPR24: 00000000 c21586f0 c2160000 00000000 c2d00cbc c2170000 c216e1a0 c2160000
      NIP [c01a9428] lockdep_init_map_type+0x4c0/0xb4c
      LR [c01a9428] lockdep_init_map_type+0x4c0/0xb4c
      Call Trace:
      [e1033db0] [c01a9428] lockdep_init_map_type+0x4c0/0xb4c (unreliable)
      [e1033df0] [c1c177b8] kw_i2c_add+0x334/0x424
      [e1033e20] [c1c18294] pmac_i2c_init+0x9ec/0xa9c
      [e1033e80] [c1c1a790] smp_core99_probe+0xbc/0x35c
      [e1033eb0] [c1c03cb0] kernel_init_freeable+0x190/0x5a4
      [e1033f10] [c000946c] kernel_init+0x28/0x154
      [e1033f30] [c0035148] ret_from_kernel_thread+0x14/0x1c
      
      Add missing lockdep_register_key()
      Reported-by: default avatarErhard Furtner <erhard_f@mailbox.org>
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/69e4f55565bb45ebb0843977801b245af0c666fe.1638264741.git.christophe.leroy@csgroup.eu
      df1f679d
    • Christophe Leroy's avatar
      powerpc/modules: Don't WARN on first module allocation attempt · f1797e4d
      Christophe Leroy authored
      module_alloc() first tries to allocate module text within 24 bits direct
      jump from kernel text, and tries a wider allocation if first one fails.
      
      When first allocation fails the following is observed in kernel logs:
      
        vmap allocation for size 2400256 failed: use vmalloc=<size> to increase size
        systemd-udevd: vmalloc error: size 2395133b, vm_struct allocation failed, mode:0xcc0(GFP_KERNEL), nodemask=(null)
        CPU: 0 PID: 127 Comm: systemd-udevd Tainted: G        W         5.15.5-gentoo-PowerMacG4 #9
        Call Trace:
        [e2a53a50] [c0ba0048] dump_stack_lvl+0x80/0xb0 (unreliable)
        [e2a53a70] [c0540128] warn_alloc+0x11c/0x2b4
        [e2a53b50] [c0531be8] __vmalloc_node_range+0xd8/0x64c
        [e2a53c10] [c00338c0] module_alloc+0xa0/0xac
        [e2a53c40] [c027a368] load_module+0x2ae0/0x8148
        [e2a53e30] [c027fc78] sys_finit_module+0xfc/0x130
        [e2a53f30] [c0035098] ret_from_syscall+0x0/0x28
        ...
      
      Add __GFP_NOWARN flag to first allocation so that no warning appears
      when it fails.
      Reported-by: default avatarErhard Furtner <erhard_f@mailbox.org>
      Fixes: 2ec13df1 ("powerpc/modules: Load modules closer to kernel text")
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/93c9b84d6ec76aaf7b4f03468e22433a6d308674.1638267035.git.christophe.leroy@csgroup.eu
      f1797e4d
    • Nicholas Piggin's avatar
      powerpc/64s: Get LPID bit width from device tree · 5402e239
      Nicholas Piggin authored
      Allow the LPID bit width and partition table size to be set at runtime
      from the device tree.
      
      Move the PID bit width detection into the same place.
      
      KVM does not support using the extra bits yet, this is mainly required
      to get the PTCR register values correct (so KVM will run but it will
      not allocate > 4096 LPIDs).
      
      OPAL firmware provides this property for POWER10 CPUs since skiboot
      commit 9b85f7d961f2 ("hdata: add mmu-pid-bits and mmu-lpid-bits for
      POWER10 CPUs").
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Reviewed-by: default avatarFabiano Rosas <farosas@linux.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20211129030915.1888332-1-npiggin@gmail.com
      5402e239
    • Athira Rajeev's avatar
      powerpc/perf: Fix PMU callbacks to clear pending PMI before resetting an overflown PMC · 2c9ac51b
      Athira Rajeev authored
      Running perf fuzzer showed below in dmesg logs:
        "Can't find PMC that caused IRQ"
      
      This means a PMU exception happened, but none of the PMC's (Performance
      Monitor Counter) were found to be overflown. There are some corner cases
      that clears the PMCs after PMI gets masked. In such cases, the perf
      interrupt handler will not find the active PMC values that had caused
      the overflow and thus leads to this message while replaying.
      
      Case 1: PMU Interrupt happens during replay of other interrupts and
      counter values gets cleared by PMU callbacks before replay:
      
      During replay of interrupts like timer, __do_irq() and doorbell
      exception, we conditionally enable interrupts via may_hard_irq_enable().
      This could potentially create a window to generate a PMI. Since irq soft
      mask is set to ALL_DISABLED, the PMI will get masked here. We could get
      IPIs run before perf interrupt is replayed and the PMU events could
      be deleted or stopped. This will change the PMU SPR values and resets
      the counters. Snippet of ftrace log showing PMU callbacks invoked in
      __do_irq():
      
        <idle>-0 [051] dns. 132025441306354: __do_irq <-call_do_irq
        <idle>-0 [051] dns. 132025441306430: irq_enter <-__do_irq
        <idle>-0 [051] dns. 132025441306503: irq_enter_rcu <-__do_irq
        <idle>-0 [051] dnH. 132025441306599: xive_get_irq <-__do_irq
        <<>>
        <idle>-0 [051] dnH. 132025441307770: generic_smp_call_function_single_interrupt <-smp_ipi_demux_relaxed
        <idle>-0 [051] dnH. 132025441307839: flush_smp_call_function_queue <-smp_ipi_demux_relaxed
        <idle>-0 [051] dnH. 132025441308057: _raw_spin_lock <-event_function
        <idle>-0 [051] dnH. 132025441308206: power_pmu_disable <-perf_pmu_disable
        <idle>-0 [051] dnH. 132025441308337: power_pmu_del <-event_sched_out
        <idle>-0 [051] dnH. 132025441308407: power_pmu_read <-power_pmu_del
        <idle>-0 [051] dnH. 132025441308477: read_pmc <-power_pmu_read
        <idle>-0 [051] dnH. 132025441308590: isa207_disable_pmc <-power_pmu_del
        <idle>-0 [051] dnH. 132025441308663: write_pmc <-power_pmu_del
        <idle>-0 [051] dnH. 132025441308787: power_pmu_event_idx <-perf_event_update_userpage
        <idle>-0 [051] dnH. 132025441308859: rcu_read_unlock_strict <-perf_event_update_userpage
        <idle>-0 [051] dnH. 132025441308975: power_pmu_enable <-perf_pmu_enable
        <<>>
        <idle>-0 [051] dnH. 132025441311108: irq_exit <-__do_irq
        <idle>-0 [051] dns. 132025441311319: performance_monitor_exception <-replay_soft_interrupts
      
      Case 2: PMI's masked during local_* operations, example local_add(). If
      the local_add() operation happens within a local_irq_save(), replay of
      PMI will be during local_irq_restore(). Similar to case 1, this could
      also create a window before replay where PMU events gets deleted or
      stopped.
      
      Fix it by updating the PMU callback function power_pmu_disable() to
      check for pending perf interrupt. If there is an overflown PMC and
      pending perf interrupt indicated in paca, clear the PMI bit in paca to
      drop that sample. Clearing of PMI bit is done in power_pmu_disable()
      since disable is invoked before any event gets deleted/stopped. With
      this fix, if there are more than one event running in the PMU, there is
      a chance that we clear the PMI bit for the event which is not getting
      deleted/stopped. The other events may still remain active. Hence to make
      sure we don't drop valid sample in such cases, another check is added in
      power_pmu_enable. This checks if there is an overflown PMC found among
      the active events and if so enable back the PMI bit. Two new helper
      functions are introduced to clear/set the PMI, ie
      clear_pmi_irq_pending() and set_pmi_irq_pending(). Helper function
      pmi_irq_pending() is introduced to give a warning if there is pending
      PMI bit in paca, but no PMC is overflown.
      
      Also there are corner cases which result in performance monitor
      interrupts being triggered during power_pmu_disable(). This happens
      since PMXE bit is not cleared along with disabling of other MMCR0 bits
      in the pmu_disable. Such PMI's could leave the PMU running and could
      trigger PMI again which will set MMCR0 PMAO bit. This could lead to
      spurious interrupts in some corner cases. Example, a timer after
      power_pmu_del() which will re-enable interrupts and triggers a PMI again
      since PMAO bit is still set. But fails to find valid overflow since PMC
      was cleared in power_pmu_del(). Fix that by disabling PMXE along with
      disabling of other MMCR0 bits in power_pmu_disable().
      
      We can't just replay PMI any time. Hence this approach is preferred
      rather than replaying PMI before resetting overflown PMC. Patch also
      documents core-book3s on a race condition which can trigger these PMC
      messages during idle path in PowerNV.
      
      Fixes: f442d004 ("powerpc/64s: Add support to mask perf interrupts and replay them")
      Reported-by: default avatarNageswara R Sastry <nasastry@in.ibm.com>
      Suggested-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Suggested-by: default avatarMadhavan Srinivasan <maddy@linux.ibm.com>
      Signed-off-by: default avatarAthira Rajeev <atrajeev@linux.vnet.ibm.com>
      Tested-by: default avatarNageswara R Sastry <rnsastry@linux.ibm.com>
      Reviewed-by: default avatarNicholas Piggin <npiggin@gmail.com>
      [mpe: Make pmi_irq_pending() return bool, reflow/reword some comments]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/1626846509-1350-2-git-send-email-atrajeev@linux.vnet.ibm.com
      2c9ac51b
    • Christophe Leroy's avatar
      powerpc/atomics: Remove atomic_inc()/atomic_dec() and friends · f05cab00
      Christophe Leroy authored
      Now that atomic_add() and atomic_sub() handle immediate operands,
      atomic_inc() and atomic_dec() have no added value compared to the
      generic fallback which calls atomic_add(1) and atomic_sub(1).
      
      Also remove atomic_inc_not_zero() which fallsback to
      atomic_add_unless() which itself fallsback to
      atomic_fetch_add_unless() which now handles immediate operands.
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/0bc64a2f18726055093dbb2e479cefc60a409cfd.1632236981.git.christophe.leroy@csgroup.eu
      f05cab00
    • Christophe Leroy's avatar
      powerpc/atomics: Use immediate operand when possible · 41d65207
      Christophe Leroy authored
      Today we get the following code generation for atomic operations:
      
      	c001bb2c:	39 20 00 01 	li      r9,1
      	c001bb30:	7d 40 18 28 	lwarx   r10,0,r3
      	c001bb34:	7d 09 50 50 	subf    r8,r9,r10
      	c001bb38:	7d 00 19 2d 	stwcx.  r8,0,r3
      
      	c001c7a8:	39 40 00 01 	li      r10,1
      	c001c7ac:	7d 00 18 28 	lwarx   r8,0,r3
      	c001c7b0:	7c ea 42 14 	add     r7,r10,r8
      	c001c7b4:	7c e0 19 2d 	stwcx.  r7,0,r3
      
      By allowing GCC to choose between immediate or regular operation,
      we get:
      
      	c001bb2c:	7d 20 18 28 	lwarx   r9,0,r3
      	c001bb30:	39 49 ff ff 	addi    r10,r9,-1
      	c001bb34:	7d 40 19 2d 	stwcx.  r10,0,r3
      	--
      	c001c7a4:	7d 40 18 28 	lwarx   r10,0,r3
      	c001c7a8:	39 0a 00 01 	addi    r8,r10,1
      	c001c7ac:	7d 00 19 2d 	stwcx.  r8,0,r3
      
      For "and", the dot form has to be used because "andi" doesn't exist.
      
      For logical operations we use unsigned 16 bits immediate.
      For arithmetic operations we use signed 16 bits immediate.
      
      On pmac32_defconfig, it reduces the text by approx another 8 kbytes.
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Acked-by: default avatarSegher Boessenkool <segher@kernel.crashing.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/2ec558d44db8045752fe9dbd29c9ba84bab6030b.1632236981.git.christophe.leroy@csgroup.eu
      41d65207
    • Christophe Leroy's avatar
      powerpc/bitops: Use immediate operand when possible · fb350784
      Christophe Leroy authored
      Today we get the following code generation for bitops like
      set or clear bit:
      
      	c0009fe0:	39 40 08 00 	li      r10,2048
      	c0009fe4:	7c e0 40 28 	lwarx   r7,0,r8
      	c0009fe8:	7c e7 53 78 	or      r7,r7,r10
      	c0009fec:	7c e0 41 2d 	stwcx.  r7,0,r8
      
      	c000d568:	39 00 18 00 	li      r8,6144
      	c000d56c:	7c c0 38 28 	lwarx   r6,0,r7
      	c000d570:	7c c6 40 78 	andc    r6,r6,r8
      	c000d574:	7c c0 39 2d 	stwcx.  r6,0,r7
      
      Most set bits are constant on lower 16 bits, so it can easily
      be replaced by the "immediate" version of the operation. Allow
      GCC to choose between the normal or immediate form.
      
      For clear bits, on 32 bits 'rlwinm' can be used instead of 'andc' for
      when all bits to be cleared are consecutive.
      
      On 64 bits we don't have any equivalent single operation for clearing,
      single bits or a few bits, we'd need two 'rldicl' so it is not
      worth it, the li/andc sequence is doing the same.
      
      With this patch we get:
      
      	c0009fe0:	7d 00 50 28 	lwarx   r8,0,r10
      	c0009fe4:	61 08 08 00 	ori     r8,r8,2048
      	c0009fe8:	7d 00 51 2d 	stwcx.  r8,0,r10
      
      	c000d558:	7c e0 40 28 	lwarx   r7,0,r8
      	c000d55c:	54 e7 05 64 	rlwinm  r7,r7,0,21,18
      	c000d560:	7c e0 41 2d 	stwcx.  r7,0,r8
      
      On pmac32_defconfig, it reduces the text by approx 10 kbytes.
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Reviewed-by: default avatarSegher Boessenkool <segher@kernel.crashing.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/e6f815d9181bab09df3b350af51149437863e9f9.1632236981.git.christophe.leroy@csgroup.eu
      fb350784
  4. 29 Nov, 2021 16 commits
  5. 25 Nov, 2021 9 commits
    • Nicholas Piggin's avatar
      powerpc/watchdog: Fix wd_smp_last_reset_tb reporting · 3d030e30
      Nicholas Piggin authored
      wd_smp_last_reset_tb now gets reset by watchdog_smp_panic() as part of
      marking CPUs stuck and removing them from the pending mask before it
      begins any printing. This causes last reset times reported to be off.
      
      Fix this by reading it into a local variable before it gets reset.
      
      Fixes: 76521c4b ("powerpc/watchdog: Avoid holding wd_smp_lock over printk and smp_send_nmi_ipi")
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20211125103346.1188958-1-npiggin@gmail.com
      3d030e30
    • Michael Ellerman's avatar
      4afc78ea
    • Nicholas Piggin's avatar
      powerpc/watchdog: read TB close to where it is used · 1f01bf90
      Nicholas Piggin authored
      When taking watchdog actions, printing messages, comparing and
      re-setting wd_smp_last_reset_tb, etc., read TB close to the point of use
      and under wd_smp_lock or printing lock (if applicable).
      
      This should keep timebase mostly monotonic with kernel log messages, and
      could prevent (in theory) a laggy CPU updating wd_smp_last_reset_tb to
      something a long way in the past, and causing other CPUs to appear to be
      stuck.
      
      These additional TB reads are all slowpath (lockup has been detected),
      so performance does not matter.
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Reviewed-by: default avatarLaurent Dufour <ldufour@linux.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20211110025056.2084347-5-npiggin@gmail.com
      1f01bf90
    • Nicholas Piggin's avatar
      powerpc/watchdog: Avoid holding wd_smp_lock over printk and smp_send_nmi_ipi · 76521c4b
      Nicholas Piggin authored
      There is a deadlock with the console_owner lock and the wd_smp_lock:
      
      CPU x takes the console_owner lock
      CPU y takes a watchdog timer interrupt and takes __wd_smp_lock
      CPU x takes a soft-NMI interrupt, detects deadlock, spins on __wd_smp_lock
      CPU y detects deadlock, tries to print something and spins on console_owner
      -> deadlock
      
      Change the watchdog locking scheme so wd_smp_lock protects the watchdog
      internal data, but "reporting" (printing, issuing NMI IPIs, taking any
      action outside of watchdog) uses a non-waiting exclusion. If a CPU detects
      a problem but can not take the reporting lock, it just returns because
      something else is already reporting. It will try again at some point.
      
      Typically hard lockup watchdog report usefulness is not impacted due to
      failure to spewing a large enough amount of data in as short a time as
      possible, but by messages getting garbled.
      
      Laurent debugged this and found the deadlock, and this patch is based on
      his general approach to avoid expensive operations while holding the lock.
      With the addition of the reporting exclusion.
      Signed-off-by: default avatarLaurent Dufour <ldufour@linux.ibm.com>
      [np: rework to add reporting exclusion update changelog]
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20211110025056.2084347-4-npiggin@gmail.com
      76521c4b
    • Nicholas Piggin's avatar
      powerpc/watchdog: tighten non-atomic read-modify-write access · 858c93c3
      Nicholas Piggin authored
      Most updates to wd_smp_cpus_pending are under lock except the watchdog
      interrupt bit clear.
      
      This can race with non-atomic RMW updates to the mask under lock, which
      can happen in two instances:
      
      Firstly, if another CPU detects this one is stuck, removes it from the
      mask, mask becomes empty and is re-filled with non-atomic stores. This
      is okay because it would re-fill the mask with this CPU's bit clear
      anyway (because this CPU is now stuck), so it doesn't matter that the
      bit clear update got "lost". Add a comment for this.
      
      Secondly, if another CPU detects a different CPU is stuck and removes it
      from the pending mask with a non-atomic store to bytes which also
      include the bit of this CPU. This case can result in the bit clear being
      lost and the end result being the bit is set. This should be so rare it
      hardly matters, but to make things simpler to reason about just avoid
      the non-atomic access for that case.
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Reviewed-by: default avatarLaurent Dufour <ldufour@linux.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20211110025056.2084347-3-npiggin@gmail.com
      858c93c3
    • Nicholas Piggin's avatar
      powerpc/watchdog: Fix missed watchdog reset due to memory ordering race · 5dad4ba6
      Nicholas Piggin authored
      It is possible for all CPUs to miss the pending cpumask becoming clear,
      and then nobody resetting it, which will cause the lockup detector to
      stop working. It will eventually expire, but watchdog_smp_panic will
      avoid doing anything if the pending mask is clear and it will never be
      reset.
      
      Order the cpumask clear vs the subsequent test to close this race.
      
      Add an extra check for an empty pending mask when the watchdog fires and
      finds its bit still clear, to try to catch any other possible races or
      bugs here and keep the watchdog working. The extra test in
      arch_touch_nmi_watchdog is required to prevent the new warning from
      firing off.
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Reviewed-by: default avatarLaurent Dufour <ldufour@linux.ibm.com>
      Debugged-by: default avatarLaurent Dufour <ldufour@linux.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20211110025056.2084347-2-npiggin@gmail.com
      5dad4ba6
    • Peiwei Hu's avatar
      powerpc/prom_init: Fix improper check of prom_getprop() · 869fb7e5
      Peiwei Hu authored
      prom_getprop() can return PROM_ERROR. Binary operator can not identify
      it.
      
      Fixes: 94d2dde7 ("[POWERPC] Efika: prune fixups and make them more carefull")
      Signed-off-by: default avatarPeiwei Hu <jlu.hpw@foxmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/tencent_BA28CC6897B7C95A92EB8C580B5D18589105@qq.com
      869fb7e5
    • Nathan Lynch's avatar
      powerpc/rtas: rtas_busy_delay_time() kernel-doc · dd5cde45
      Nathan Lynch authored
      Provide API documentation for rtas_busy_delay_time(), explaining why we
      return the same value for 9900 and -2.
      Signed-off-by: default avatarNathan Lynch <nathanl@linux.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20211117060259.957178-3-nathanl@linux.ibm.com
      dd5cde45
    • Nathan Lynch's avatar
      powerpc/rtas: rtas_busy_delay() improvements · 38f7b706
      Nathan Lynch authored
      Generally RTAS cannot block, and in PAPR it is required to return control
      to the OS within a few tens of microseconds. In order to support operations
      which may take longer to complete, many RTAS primitives can return
      intermediate -2 ("busy") or 990x ("extended delay") values, which indicate
      that the OS should reattempt the same call with the same arguments at some
      point in the future.
      
      Current versions of PAPR are less than clear about this, but the intended
      meanings of these values in more detail are:
      
      RTAS_BUSY (-2): RTAS has suspended a potentially long-running operation in
      order to meet its latency obligation and give the OS the opportunity to
      perform other work. RTAS can resume making progress as soon as the OS
      reattempts the call.
      
      RTAS_EXTENDED_DELAY_{MIN...MAX} (9900-9905): RTAS must wait for an external
      event to occur or for internal contention to resolve before it can complete
      the requested operation. The value encodes a non-binding hint as to roughly
      how long the OS should wait before calling again, but the OS is allowed to
      reattempt the call sooner or even immediately.
      
      Linux of course must take its own CPU scheduling obligations into account
      when handling these statuses; e.g. a task which receives an RTAS_BUSY
      status should check whether to reschedule before it attempts the RTAS call
      again to avoid starving other tasks.
      
      rtas_busy_delay() is a helper function that "consumes" a busy or extended
      delay status. Common usage:
      
          int rc;
      
          do {
              rc = rtas_call(rtas_token("some-function"), ...);
          } while (rtas_busy_delay(rc));
      
          /* convert rc to Linux error value, etc */
      
      If rc is a busy or extended delay status, the caller can rely on
      rtas_busy_delay() to perform an appropriate sleep or reschedule and return
      nonzero. Other statuses are handled normally by the caller.
      
      The current implementation of rtas_busy_delay() both oversleeps and
      overuses the CPU:
      
      *  It performs msleep() for all 990x and even when no delay is
         suggested (-2), but this is understood to actually sleep for two jiffies
         minimum in practice (20ms with HZ=100). 9900 (1ms) and 9901 (10ms)
         appear to be the most common extended delay statuses, and the
         oversleeping measurably lengthens DLPAR operations, which perform
         many RTAS calls.
      
      *  It does not sleep on 990x unless need_resched() is true, causing code
         like the loop above to needlessly retry, wasting CPU time.
      
      Alter the logic to align better with the intended meanings:
      
      *  When passed RTAS_BUSY, perform cond_resched() and return without
         sleeping. The caller should reattempt immediately
      
      *  Always sleep when passed an extended delay status, using usleep_range()
         for precise shorter sleeps. Limit the sleep time to one second even
         though there are higher architected values.
      
      Change rtas_busy_delay()'s return type to bool to better reflect its usage,
      and add kernel-doc.
      
      rtas_busy_delay_time() is unchanged, even though it "incorrectly" returns 1
      for RTAS_BUSY. There are users of that API with open-coded delay loops in
      sensitive contexts that will have to be taken on an individual basis.
      
      Brief results for addition and removal of 5GB memory on a small P9 PowerVM
      partition follow. Load was generated with stress-ng --cpu N. For add,
      elapsed time is greatly reduced without significant change in the number of
      RTAS calls or time spent on CPU. For remove, elapsed time is modestly
      reduced, with significant reductions in RTAS calls and time spent on CPU.
      
      With no competing workload (- before, + after):
      
        Performance counter stats for 'bash -c echo "memory add count 20" > /sys/kernel/dlpar' (10 runs):
      
      -             1,935      probe:rtas_call           #    0.003 M/sec                    ( +-  0.22% )
      -            609.99 msec task-clock                #    0.183 CPUs utilized            ( +-  0.19% )
      +             1,956      probe:rtas_call           #    0.003 M/sec                    ( +-  0.17% )
      +            618.56 msec task-clock                #    0.278 CPUs utilized            ( +-  0.11% )
      
      -            3.3322 +- 0.0670 seconds time elapsed  ( +-  2.01% )
      +            2.2222 +- 0.0416 seconds time elapsed  ( +-  1.87% )
      
        Performance counter stats for 'bash -c echo "memory remove count 20" > /sys/kernel/dlpar' (10 runs):
      
      -             6,224      probe:rtas_call           #    0.008 M/sec                    ( +-  2.57% )
      -            750.36 msec task-clock                #    0.190 CPUs utilized            ( +-  2.01% )
      +               843      probe:rtas_call           #    0.003 M/sec                    ( +-  0.12% )
      +            250.66 msec task-clock                #    0.068 CPUs utilized            ( +-  0.17% )
      
      -            3.9394 +- 0.0890 seconds time elapsed  ( +-  2.26% )
      +             3.678 +- 0.113 seconds time elapsed  ( +-  3.07% )
      
      With all CPUs 100% busy (- before, + after):
      
        Performance counter stats for 'bash -c echo "memory add count 20" > /sys/kernel/dlpar' (10 runs):
      
      -             2,979      probe:rtas_call           #    0.003 M/sec                    ( +-  0.12% )
      -          1,096.62 msec task-clock                #    0.105 CPUs utilized            ( +-  0.10% )
      +             2,981      probe:rtas_call           #    0.003 M/sec                    ( +-  0.22% )
      +          1,095.26 msec task-clock                #    0.154 CPUs utilized            ( +-  0.21% )
      
      -            10.476 +- 0.104 seconds time elapsed  ( +-  1.00% )
      +            7.1124 +- 0.0865 seconds time elapsed  ( +-  1.22% )
      
        Performance counter stats for 'bash -c echo "memory remove count 20" > /sys/kernel/dlpar' (10 runs):
      
      -             2,702      probe:rtas_call           #    0.004 M/sec                    ( +-  4.00% )
      -            722.71 msec task-clock                #    0.067 CPUs utilized            ( +-  2.41% )
      +             1,246      probe:rtas_call           #    0.003 M/sec                    ( +-  0.25% )
      +            487.73 msec task-clock                #    0.049 CPUs utilized            ( +-  0.20% )
      
      -            10.829 +- 0.163 seconds time elapsed  ( +-  1.51% )
      +            9.9887 +- 0.0866 seconds time elapsed  ( +-  0.87% )
      Signed-off-by: default avatarNathan Lynch <nathanl@linux.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20211117060259.957178-2-nathanl@linux.ibm.com
      38f7b706