- 19 Jun, 2017 8 commits
-
-
Nicholas Piggin authored
In a busy system, idle wakeups can be expected from IPIs and device interrupts. Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
Idle code now always runs at the 0xc... effective address whether in real or virtual mode. This means rfid can be ditched, along with a lot of SRR manipulations. In the wakeup path, carry SRR1 around in r12. Use mtmsrd to change MSR states as required. This also balances the return prediction for the idle call, by doing blr rather than rfid to return to the idle caller. On POWER9, 2-process context switch on different cores, with snooze disabled, increases performance by 2%. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> [mpe: Incorporate v2 fixes from Nick] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
Have the system reset idle wakeup handlers branched to in real mode with the 0xc... kernel address applied. This allows simplifications of avoiding rfid when switching to virtual mode in the wakeup handler. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
The __replay_interrupt() code is branched to with bl, but the caller is returned to directly with rfid from the interrupt. Instead, rfid to a stub that returns to the caller with blr, which should keep the return branch predictor balanced. Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
msgsnd doorbell exceptions are cleared when the doorbell interrupt is taken. However if a doorbell exception causes a system reset interrupt wake from power saving state, the message is not cleared. Processing the doorbell from the system reset interrupt requires msgclr to avoid taking the exception again. Testing this plus the previous wakup direct patch gives: original wakeup direct msgclr Different threads, same core: 315k/s 264k/s 345k/s Different cores: 235k/s 242k/s 242k/s Net speedup is +10% for same core, and +3% for different core. Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
When the CPU wakes from low power state, it begins at the system reset interrupt with the exception that caused the wakeup encoded in SRR1. Today, powernv idle wakeup ignores the wakeup reason (except a special case for HMI), and the regular interrupt corresponding to the exception will fire after the idle wakeup exits. Change this to replay the interrupt from the idle wakeup before interrupts are hard-enabled. Test on POWER8 of context_switch selftests benchmark with polling idle disabled (e.g., always nap, giving cross-CPU IPIs) gives the following results: original wakeup direct Different threads, same core: 315k/s 264k/s Different cores: 235k/s 242k/s There is a slowdown for doorbell IPI (same core) case because system reset wakeup does not clear the message and the doorbell interrupt fires again needlessly. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
Rather than concern ourselves with any soft-mask logic in the CPU hotplug handler, just hard disable interrupts. This ensures there are no lazy-irqs pending, which means we can call directly to idle instruction in order to sleep. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
This simplifies the asm and fixes irq-off tracing over sleep instructions. Also move powersave_nap check for POWER8 into C code, and move PSSCR register value calculation for POWER9 into C. Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
- 15 Jun, 2017 9 commits
-
-
Murilo Opsfelder Araujo authored
drivers/watchdog/wdrtas.c uses symbols defined in arch/powerpc/kernel/rtas.c, which are exported iff CONFIG_PPC_RTAS is selected. Building wdrtas.c without setting CONFIG_PPC_RTAS throws the following errors: ERROR: ".rtas_token" [drivers/watchdog/wdrtas.ko] undefined! ERROR: "rtas_data_buf" [drivers/watchdog/wdrtas.ko] undefined! ERROR: "rtas_data_buf_lock" [drivers/watchdog/wdrtas.ko] undefined! ERROR: ".rtas_get_sensor" [drivers/watchdog/wdrtas.ko] undefined! ERROR: ".rtas_call" [drivers/watchdog/wdrtas.ko] undefined! This was identified during a randconfig build where CONFIG_WATCHDOG_RTAS=m and CONFIG_PPC_RTAS was not set. Logs are here: http://kisskb.ellerman.id.au/kisskb/buildresult/12982152/ This patch fixes the issue by updating CONFIG_WATCHDOG_RTAS to depend on just CONFIG_PPC_RTAS, removing COMPILE_TEST entirely. Signed-off-by: Murilo Opsfelder Araujo <mopsfelder@gmail.com> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
The ISA v3.0B copy-paste facility only requires cpabort when switching to a process that has foreign real addresses mapped (direct access to accelerators), to clear a potential copy buffer filled by a previous thread. There is no accelerator driver implemented yet, so cpabort can be removed. It can be be re-added when a driver is implemented. POWER9 DD1 requires the copy buffer to always be cleared on context switch, but if accelerators are not in use, then an unpaired copy from a dummy region is sufficient to clear data out of the copy buffer. This increases context switch performance by about 5% on POWER9. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
The sync (aka. hwsync, aka. heavyweight sync) in the context switch code to prevent MMIO access being reordered from the point of view of a single process if it gets migrated to a different CPU is not required because there is an hwsync performed earlier in the context switch path. Comment this so it's clear enough if anything changes on the scheduler or the powerpc sides. Remove the hwsync from _switch. This improves context switch performance by 2-3% on POWER8. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
There is no need to explicitly break the reservation in _switch, because we are guaranteed that the context switch path will include a larx/stcx. Comment the guarantee and remove the reservation clear from _switch. This is worth 1-2% in context switch performance. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
Commit 4387e9ff25 ("[POWERPC] Fix PMU + soft interrupt disable bug") hard disabled interrupts over the low level context switch, because the SLB management can't cope with a PMU interrupt accesing the stack in that window. Radix based kernel mapping does not use the SLB so it does not require interrupts hard disabled here. This is worth 1-2% in context switch performance on POWER9. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
The syscall exit code that branches to restore_math is quite heavy on Book3S, consisting of 2 mtmsr instructions. Threads that don't use both FP and vector can get caught here if the kernel ever uses FP or vector. Lazy-FP/vec context switching also trips this case. So check for lazy FP and vector before switching RI for restore_math. Move most of this case out of line. For threads that do want to restore math registers, the MSR switches are still suboptimal. Future direction may be to use a soft-RI bit to avoid MSR switches in kernel (similar to soft-EE), but for now at least the no-restore POWER9 context switch rate increases by about 5% due to sched_yield(2) return performance. I haven't constructed a test to measure the syscall cost. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
After bc355125 ("powerpc/64: Allow for relocation-on interrupts from guest to host"), a getppid() system call goes from 307 cycles to 358 cycles (+17%) on POWER8. This is due significantly to the scratch SPR used by the hypercall check. It turns out there are a some volatile registers common to both system call and hypercall (in particular, r12, cr0, ctr), which can be used to avoid the SPR and some other overheads. This brings getppid to 320 cycles (+4%). Testing hcall entry performance by running "sc 1" in guest userspace before this patch is 854 cycles, afterwards is 826. Also a small win there. POWER9 syscall is improved by about the same amount, hcall not tested. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Michael Ellerman authored
Currently we map the whole linear mapping with PAGE_KERNEL_X. Instead we should check if the page overlaps the kernel text and only then add PAGE_KERNEL_X. Note that we still use 1G pages if they're available, so this will typically still result in a 1G executable page at KERNELBASE. So this fix is primarily useful for catching stray branches to high linear mapping addresses. Without this patch, we can execute at 1G in xmon using: 0:mon> m c000000040000000 c000000040000000 00 l c000000040000000 00000000 01006038 c000000040000004 00000000 2000804e c000000040000008 00000000 x 0:mon> di c000000040000000 c000000040000000 38600001 li r3,1 c000000040000004 4e800020 blr 0:mon> p c000000040000000 return value is 0x1 After we get a 400 as expected: 0:mon> p c000000040000000 *** 400 exception occurred Fixes: 2bfd65e4 ("powerpc/mm/radix: Add radix callbacks for early init routines") Cc: stable@vger.kernel.org # v4.7+ Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Balbir Singh <bsingharora@gmail.com>
-
Michael Ellerman authored
This reverts commit 45cb08f4. For some reason this is causing IRQ problems on Freescale Book3E machines, eg on my p5020ds: irq 25: nobody cared (try booting with the "irqpoll" option) CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.12.0-rc3-gcc-6.3.1-00037-g45cb08f4 #624 Call Trace: [c0000000fffdbb10] [c00000000049962c] .dump_stack+0xa8/0xe8 (unreliable) [c0000000fffdbba0] [c0000000000babf4] .__report_bad_irq+0x54/0x140 [c0000000fffdbc40] [c0000000000bb11c] .note_interrupt+0x324/0x380 [c0000000fffdbd00] [c0000000000b7110] .handle_irq_event_percpu+0x68/0x88 [c0000000fffdbd90] [c0000000000b718c] .handle_irq_event+0x5c/0xa8 [c0000000fffdbe10] [c0000000000bc01c] .handle_fasteoi_irq+0xe4/0x298 [c0000000fffdbe90] [c0000000000b59c4] .generic_handle_irq+0x50/0x74 [c0000000fffdbf10] [c0000000000075d8] .__do_irq+0x74/0x1f0 [c0000000fffdbf90] [c0000000000189f8] .call_do_irq+0x14/0x24 [c0000000f7173060] [c0000000000077e4] .do_IRQ+0x90/0x120 [c0000000f7173100] [c00000000001d93c] exc_0x500_common+0xfc/0x100 --- interrupt: 501 at .prepare_to_wait_event+0xc/0x14c LR = .fsl_elbc_run_command+0xc8/0x23c [c0000000f71734d0] [c00000000065f418] .nand_reset+0xb8/0x168 [c0000000f7173560] [c00000000065fec4] .nand_scan_ident+0x2b0/0x1638 [c0000000f7173650] [c000000000666cd8] .fsl_elbc_nand_probe+0x34c/0x5f0 ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300) [c0000000f7173750] [c0000000005a3c60] .platform_drv_probe+0x64/0xb0 [c0000000f71737d0] [c0000000005a12e0] .really_probe+0x290/0x334 [c0000000f7173870] [c0000000005a14a0] .__driver_attach+0x11c/0x120 [c0000000f7173900] [c00000000059e6a0] .bus_for_each_dev+0x98/0xfc [c0000000f71739a0] [c0000000005a0b3c] .driver_attach+0x34/0x4c [c0000000f7173a20] [c0000000005a04b0] .bus_add_driver+0x1ac/0x2e0 [c0000000f7173ac0] [c0000000005a2170] .driver_register+0x94/0x160 [c0000000f7173b40] [c0000000005a3be0] .__platform_driver_register+0x60/0x7c [c0000000f7173bc0] [c000000000d6aab4] .fsl_elbc_nand_driver_init+0x24/0x38 [c0000000f7173c30] [c000000000001934] .do_one_initcall+0x68/0x1b8 [c0000000f7173d00] [c000000000d210f8] .kernel_init_freeable+0x260/0x338 [c0000000f7173db0] [c0000000000021b0] .kernel_init+0x20/0xe70 [c0000000f7173e30] [c0000000000009bc] .ret_from_kernel_thread+0x58/0x9c handlers: [<c000000000ed85c8>] .fsl_lbc_ctrl_irq Disabling IRQ #25 Ben also had concerns with the implementation being potentially slow on some PICs, so revert it for now. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
- 06 Jun, 2017 2 commits
-
-
Nicholas Piggin authored
The i-side 0111b machine check, which is "Instruction Fetch to foreign address space", was missed by 7b9f71f9 ("powerpc/64s: POWER9 machine check handler"). The POWER9 processor core considers host real addresses with a nonzero value in RA(8:12) as foreign address space, accessible only by the copy and paste instructions. The copy and paste instruction pair can be used to invoke the Nest accelerators via the Virtual Accelerator Switchboard (VAS). It is an error for any regular load/store or ifetch to go to a foreign addresses. When relocation is on, this causes an MMU exception. When relocation is off, a machine check exception. It is possible to trigger this machine check by branching to a foreign address with MSR[IR]=0. Fixes: 7b9f71f9 ("powerpc/64s: POWER9 machine check handler") Reported-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Dan Carpenter authored
We should unlock if get_cxl_adapter() fails. Fixes: 594ff7d0 ("cxl: Support to flash a new image on the adapter from a guest") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
- 05 Jun, 2017 9 commits
-
-
Christophe Leroy authored
These two functions implement the same semantics, so unify their naming so we can share code that calls them. The longer name is more descriptive so use it. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Acked-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Balbir Singh authored
Add __GFP_ACCOUNT to __hugepte_alloc() Signed-off-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Balbir Singh authored
Add support in pte_alloc_one() and pgd_alloc() by passing __GFP_ACCOUNT in the flags Signed-off-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Balbir Singh authored
Introduce a helper pgtable_gfp_flags() which just returns the current gfp flags and adds __GFP_ACCOUNT to account for page table allocation. The generic helper is added to include/asm/pgalloc.h and has two variants - WARNING ugly bits ahead 1. If the header is included from a module, no check for mm == &init_mm is done, since init_mm is not exported 2. For kernel includes, the check is done and required see (3e79ec7d arch: x86: charge page tables to kmemcg) The fundamental assumption is that no module should be doing pgd/pud/pmd and pte alloc's on behalf of init_mm directly. NOTE: This adds an overhead to pmd/pud/pgd allocations similar to x86. The other alternative was to implement pmd_alloc_kernel/pud_alloc_kernel and pgd_alloc_kernel with their offset variants. For 4k page size, pte_alloc_one no longer calls pte_alloc_one_kernel. Signed-off-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Balbir Singh authored
Currently in hpte_need_flush() if there is no batch pending we always do a global TLB flush, which is inefficient if the mm has never run on another thread. Instead do the same check that __flush_tlb_pending() does and check if a local flush is sufficient when batch->active is false. Instead of open-coding it we use mm_is_thread_local(). Signed-off-by: Balbir Singh <bsingharora@gmail.com> [mpe: Don't use a local, just inline mm_is_thread_local()] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Yang Li authored
Signed-off-by: Li Yang <leoyang.li@nxp.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Yang Li authored
Add myself as the maintainer for drivers/fsl/soc/ and fix the scope for device tree bindings. Signed-off-by: Li Yang <leoyang.li@nxp.com> Acked-by: Scott Wood <oss@buserror.net> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
This reduces overhead of mutex locking and increases context switch rate significantly (which helps to measure and profile the context switch path). Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Colin Ian King authored
Collation of some spelling fixes from Colin. Attemping -> Attempting intialized -> initialized missmanaged -> mismanaged Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
- 02 Jun, 2017 12 commits
-
-
Matt Brown authored
The xor_vmx.c file is used for the RAID5 xor operations. In these functions altivec is enabled to run the operation and then disabled. The code uses enable_kernel_altivec() around the core of the algorithm, however the whole file is built with -maltivec, so the compiler is within its rights to generate altivec code anywhere. This has been seen at least once in the wild: 0:mon> di $xor_altivec_2 c0000000000b97d0 3c4c01d9 addis r2,r12,473 c0000000000b97d4 3842db30 addi r2,r2,-9424 c0000000000b97d8 7c0802a6 mflr r0 c0000000000b97dc f8010010 std r0,16(r1) c0000000000b97e0 60000000 nop c0000000000b97e4 7c0802a6 mflr r0 c0000000000b97e8 faa1ffa8 std r21,-88(r1) ... c0000000000b981c f821ff41 stdu r1,-192(r1) c0000000000b9820 7f8101ce stvx v28,r1,r0 <-- POP c0000000000b9824 38000030 li r0,48 c0000000000b9828 7fa101ce stvx v29,r1,r0 ... c0000000000b984c 4bf6a06d bl c0000000000238b8 # enable_kernel_altivec This patch splits the non-altivec code into xor_vmx_glue.c which calls the altivec functions in xor_vmx.c. By compiling xor_vmx_glue.c without -maltivec we can guarantee that altivec instruction will not be executed outside of the enable/disable block. Signed-off-by: Matt Brown <matthew.brown.dev@gmail.com> [mpe: Rework change log and include disassembly] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Hari Bathini authored
By default, 5% of system RAM is reserved for preserving boot memory. Alternatively, a user can specify the amount of memory to reserve. See Documentation/powerpc/firmware-assisted-dump.txt for details. In addition to the memory reserved for preserving boot memory, some more memory is reserved, to save HPTE region, CPU state data and ELF core headers. Memory Reservation during first kernel looks like below: Low memory Top of memory 0 boot memory size | | | |<--Reserved dump area -->| V V | Permanent Reservation V +-----------+----------/ /----------+---+----+-----------+----+ | | |CPU|HPTE| DUMP |ELF | +-----------+----------/ /----------+---+----+-----------+----+ | ^ | | \ / ------------------------------------------- Boot memory content gets transferred to reserved area by firmware at the time of crash This implicitly means that the sum of the sizes of boot memory, CPU state data, HPTE region, DUMP preserving area and ELF core headers can't be greater than the total memory size. But currently, a user is allowed to specify any value as boot memory size. So, the above rule is violated when a boot memory size around 50% of the total available memory is specified. As the kernel is not handling this currently, it may lead to undefined behavior. Fix it by setting an upper limit for boot memory size to 25% of the total available memory. Also, instead of using memblock_end_of_DRAM(), which doesn't take the holes, if any, in the memory layout into account, use memblock_phys_mem_size() to calculate the percentage of total available memory. Signed-off-by: Hari Bathini <hbathini@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Hari Bathini authored
With commit f6e6bedb ("powerpc/fadump: Reserve memory at an offset closer to bottom of RAM"), memory for fadump is no longer reserved at the top of RAM. But there are still a few places which say so. Change them appropriately. Signed-off-by: Hari Bathini <hbathini@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Hari Bathini authored
With commit 11550dc0 ("powerpc/fadump: reuse crashkernel parameter for fadump memory reservation"), 'fadump_reserve_mem=' parameter is deprecated in favor of 'crashkernel=' parameter. Add a warning if 'fadump_reserve_mem=' is still used. Fixes: 11550dc0 ("powerpc/fadump: reuse crashkernel parameter for fadump memory reservation") Suggested-by: Prarit Bhargava <prarit@redhat.com> Signed-off-by: Hari Bathini <hbathini@linux.vnet.ibm.com> [mpe: Unsplit long printk strings] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Michal Suchanek authored
- log an error message when registration fails and no error code listed in the switch is returned - translate the hv error code to posix error code and return it from fw_register - return the posix error code from fw_register to the process writing to sysfs - return EEXIST on re-registration - return success on deregistration when fadump is not registered - return ENODEV when no memory is reserved for fadump Signed-off-by: Michal Suchanek <msuchanek@suse.de> Tested-by: Hari Bathini <hbathini@linux.vnet.ibm.com> [mpe: Use pr_err() to shrink the error print] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Christophe Leroy authored
With the __ilog2() function as defined in arch/powerpc/include/asm/bitops.h, GCC will not optimise the code in case of constant parameter. The generic ilog2() function in include/linux/log2.h is written to handle the case of the constant parameter. This patch discards the three __ilog2() functions and defines __ilog2() as ilog2() For non constant calls, the generated code is doing the same: int test__ilog2(unsigned long x) { return __ilog2(x); } int test__ilog2_u32(u32 n) { return __ilog2_u32(n); } int test__ilog2_u64(u64 n) { return __ilog2_u64(n); } On PPC32 before the patch: 00000000 <test__ilog2>: 0: 7c 63 00 34 cntlzw r3,r3 4: 20 63 00 1f subfic r3,r3,31 8: 4e 80 00 20 blr 0000000c <test__ilog2_u32>: c: 7c 63 00 34 cntlzw r3,r3 10: 20 63 00 1f subfic r3,r3,31 14: 4e 80 00 20 blr On PPC32 after the patch: 00000000 <test__ilog2>: 0: 7c 63 00 34 cntlzw r3,r3 4: 20 63 00 1f subfic r3,r3,31 8: 4e 80 00 20 blr 0000000c <test__ilog2_u32>: c: 7c 63 00 34 cntlzw r3,r3 10: 20 63 00 1f subfic r3,r3,31 14: 4e 80 00 20 blr On PPC64 before the patch: 0000000000000000 <.test__ilog2>: 0: 7c 63 00 74 cntlzd r3,r3 4: 20 63 00 3f subfic r3,r3,63 8: 7c 63 07 b4 extsw r3,r3 c: 4e 80 00 20 blr 0000000000000010 <.test__ilog2_u32>: 10: 7c 63 00 34 cntlzw r3,r3 14: 20 63 00 1f subfic r3,r3,31 18: 7c 63 07 b4 extsw r3,r3 1c: 4e 80 00 20 blr 0000000000000020 <.test__ilog2_u64>: 20: 7c 63 00 74 cntlzd r3,r3 24: 20 63 00 3f subfic r3,r3,63 28: 7c 63 07 b4 extsw r3,r3 2c: 4e 80 00 20 blr On PPC64 after the patch: 0000000000000000 <.test__ilog2>: 0: 7c 63 00 74 cntlzd r3,r3 4: 20 63 00 3f subfic r3,r3,63 8: 7c 63 07 b4 extsw r3,r3 c: 4e 80 00 20 blr 0000000000000010 <.test__ilog2_u32>: 10: 7c 63 00 34 cntlzw r3,r3 14: 20 63 00 1f subfic r3,r3,31 18: 7c 63 07 b4 extsw r3,r3 1c: 4e 80 00 20 blr 0000000000000020 <.test__ilog2_u64>: 20: 7c 63 00 74 cntlzd r3,r3 24: 20 63 00 3f subfic r3,r3,63 28: 7c 63 07 b4 extsw r3,r3 2c: 4e 80 00 20 blr Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Christophe Leroy authored
With the ffz() function as defined in arch/powerpc/include/asm/bitops.h GCC will not optimise the code in case of constant parameter. This patch replaces ffz() by the generic function. The generic ffz(x) expects to never be called with ~x == 0 as written in the comment in include/asm-generic/bitops/ffz.h The only user of ffz() within arch/powerpc/ is platforms/512x/mpc5121_ads_cpld.c, which checks if x is not 0xff For non constant calls, the generated code is doing the same: unsigned long testffz(unsigned long x) { return ffz(x); } On PPC32, before the patch: 00000018 <testffz>: 18: 7c 63 18 f9 not. r3,r3 1c: 40 82 00 0c bne 28 <testffz+0x10> 20: 38 60 00 20 li r3,32 24: 4e 80 00 20 blr 28: 7d 23 00 d0 neg r9,r3 2c: 7d 23 18 38 and r3,r9,r3 30: 7c 63 00 34 cntlzw r3,r3 34: 20 63 00 1f subfic r3,r3,31 38: 4e 80 00 20 blr On PPC32, after the patch: 00000018 <testffz>: 18: 39 23 00 01 addi r9,r3,1 1c: 7d 23 18 78 andc r3,r9,r3 20: 7c 63 00 34 cntlzw r3,r3 24: 20 63 00 1f subfic r3,r3,31 28: 4e 80 00 20 blr On PPC64, before the patch: 0000000000000030 <.testffz>: 30: 7c 60 18 f9 not. r0,r3 34: 38 60 00 40 li r3,64 38: 4d 82 00 20 beqlr 3c: 7c 60 00 d0 neg r3,r0 40: 7c 63 00 38 and r3,r3,r0 44: 7c 63 00 74 cntlzd r3,r3 48: 20 63 00 3f subfic r3,r3,63 4c: 7c 63 07 b4 extsw r3,r3 50: 4e 80 00 20 blr On PPC64, after the patch: 0000000000000030 <.testffz>: 30: 38 03 00 01 addi r0,r3,1 34: 7c 03 18 78 andc r3,r0,r3 38: 7c 63 00 74 cntlzd r3,r3 3c: 20 63 00 3f subfic r3,r3,63 40: 4e 80 00 20 blr Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Christophe Leroy authored
With the fls() functions as defined in arch/powerpc/include/asm/bitops.h GCC will not optimise the code in case of constant parameter. This patch replaces __fls() by the builtin function, and modifies fls() and fls64() to use builtins instead of inline assembly For non constant calls, the generated code is doing the same: int testfls(unsigned int x) { return fls(x); } unsigned long test__fls(unsigned long x) { return __fls(x); } int testfls64(__u64 x) { return fls64(x); } On PPC32, before the patch: 00000064 <testfls>: 64: 7c 63 00 34 cntlzw r3,r3 68: 20 63 00 20 subfic r3,r3,32 6c: 4e 80 00 20 blr 00000070 <test__fls>: 70: 7c 63 00 34 cntlzw r3,r3 74: 20 63 00 1f subfic r3,r3,31 78: 4e 80 00 20 blr 0000007c <testfls64>: 7c: 2c 03 00 00 cmpwi r3,0 80: 40 82 00 10 bne 90 <testfls64+0x14> 84: 7c 83 00 34 cntlzw r3,r4 88: 20 63 00 20 subfic r3,r3,32 8c: 4e 80 00 20 blr 90: 7c 63 00 34 cntlzw r3,r3 94: 20 63 00 40 subfic r3,r3,64 98: 4e 80 00 20 blr On PPC32, after the patch: 00000054 <testfls>: 54: 7c 63 00 34 cntlzw r3,r3 58: 20 63 00 20 subfic r3,r3,32 5c: 4e 80 00 20 blr 00000060 <test__fls>: 60: 7c 63 00 34 cntlzw r3,r3 64: 20 63 00 1f subfic r3,r3,31 68: 4e 80 00 20 blr 0000006c <testfls64>: 6c: 2c 03 00 00 cmpwi r3,0 70: 41 82 00 10 beq 80 <testfls64+0x14> 74: 7c 63 00 34 cntlzw r3,r3 78: 20 63 00 40 subfic r3,r3,64 7c: 4e 80 00 20 blr 80: 7c 83 00 34 cntlzw r3,r4 84: 20 63 00 40 subfic r3,r3,32 88: 4e 80 00 20 blr On PPC64, before the patch: 00000000000000a0 <.testfls>: a0: 7c 63 00 34 cntlzw r3,r3 a4: 20 63 00 20 subfic r3,r3,32 a8: 7c 63 07 b4 extsw r3,r3 ac: 4e 80 00 20 blr 00000000000000b0 <.test__fls>: b0: 7c 63 00 74 cntlzd r3,r3 b4: 20 63 00 3f subfic r3,r3,63 b8: 7c 63 07 b4 extsw r3,r3 bc: 4e 80 00 20 blr 00000000000000c0 <.testfls64>: c0: 7c 63 00 74 cntlzd r3,r3 c4: 20 63 00 40 subfic r3,r3,64 c8: 7c 63 07 b4 extsw r3,r3 cc: 4e 80 00 20 blr On PPC64, after the patch: 0000000000000090 <.testfls>: 90: 7c 63 00 34 cntlzw r3,r3 94: 20 63 00 20 subfic r3,r3,32 98: 7c 63 07 b4 extsw r3,r3 9c: 4e 80 00 20 blr 00000000000000a0 <.test__fls>: a0: 7c 63 00 74 cntlzd r3,r3 a4: 20 63 00 3f subfic r3,r3,63 a8: 4e 80 00 20 blr ac: 60 00 00 00 nop 00000000000000b0 <.testfls64>: b0: 7c 63 00 74 cntlzd r3,r3 b4: 20 63 00 40 subfic r3,r3,64 b8: 7c 63 07 b4 extsw r3,r3 bc: 4e 80 00 20 blr Those builtins have been in GCC since at least 3.4.6 (see https://gcc.gnu.org/onlinedocs/gcc-3.4.6/gcc/Other-Builtins.html ) Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Christophe Leroy authored
With the ffs() function as defined in arch/powerpc/include/asm/bitops.h GCC will not optimise the code in case of constant parameter, as shown by the small exemple below. int ffs_test(void) { return 4 << ffs(31); } c0012334 <ffs_test>: c0012334: 39 20 00 01 li r9,1 c0012338: 38 60 00 04 li r3,4 c001233c: 7d 29 00 34 cntlzw r9,r9 c0012340: 21 29 00 20 subfic r9,r9,32 c0012344: 7c 63 48 30 slw r3,r3,r9 c0012348: 4e 80 00 20 blr With this patch, the same function will compile as follows: c0012334 <ffs_test>: c0012334: 38 60 00 08 li r3,8 c0012338: 4e 80 00 20 blr The same happens with __ffs() For non constant calls, the generated code is doing the same, allthought it is slightly different on 64 bits for ffs(): unsigned long test__ffs(unsigned long x) { return __ffs(x); } int testffs(int x) { return ffs(x); } On PPC32, before the patch: 0000003c <test__ffs>: 3c: 7d 23 00 d0 neg r9,r3 40: 7d 23 18 38 and r3,r9,r3 44: 7c 63 00 34 cntlzw r3,r3 48: 20 63 00 1f subfic r3,r3,31 4c: 4e 80 00 20 blr 00000050 <testffs>: 50: 7d 23 00 d0 neg r9,r3 54: 7d 23 18 38 and r3,r9,r3 58: 7c 63 00 34 cntlzw r3,r3 5c: 20 63 00 20 subfic r3,r3,32 60: 4e 80 00 20 blr On PPC32, after the patch: 0000002c <test__ffs>: 2c: 7d 23 00 d0 neg r9,r3 30: 7d 23 18 38 and r3,r9,r3 34: 7c 63 00 34 cntlzw r3,r3 38: 20 63 00 1f subfic r3,r3,31 3c: 4e 80 00 20 blr 00000040 <testffs>: 40: 7d 23 00 d0 neg r9,r3 44: 7d 23 18 38 and r3,r9,r3 48: 7c 63 00 34 cntlzw r3,r3 4c: 20 63 00 20 subfic r3,r3,32 50: 4e 80 00 20 blr On PPC64, before the patch: 0000000000000060 <.test__ffs>: 60: 7c 03 00 d0 neg r0,r3 64: 7c 03 18 38 and r3,r0,r3 68: 7c 63 00 74 cntlzd r3,r3 6c: 20 63 00 3f subfic r3,r3,63 70: 7c 63 07 b4 extsw r3,r3 74: 4e 80 00 20 blr 0000000000000080 <.testffs>: 80: 7c 03 00 d0 neg r0,r3 84: 7c 03 18 38 and r3,r0,r3 88: 7c 63 00 74 cntlzd r3,r3 8c: 20 63 00 40 subfic r3,r3,64 90: 7c 63 07 b4 extsw r3,r3 94: 4e 80 00 20 blr On PPC64, after the patch: 0000000000000050 <.test__ffs>: 50: 7c 03 00 d0 neg r0,r3 54: 7c 03 18 38 and r3,r0,r3 58: 7c 63 00 74 cntlzd r3,r3 5c: 20 63 00 3f subfic r3,r3,63 60: 4e 80 00 20 blr 0000000000000070 <.testffs>: 70: 7c 03 00 d0 neg r0,r3 74: 7c 03 18 38 and r3,r0,r3 78: 7c 63 00 34 cntlzw r3,r3 7c: 20 63 00 20 subfic r3,r3,32 80: 7c 63 07 b4 extsw r3,r3 84: 4e 80 00 20 blr (ffs() operates on an int so cntlzw is equivalent to cntlzd) In addition, when reading the generated vmlinux, we can observe that with the builtin functions, GCC sometimes efficiently spreads the instructions within the generated functions while the inline assembly force them to remain grouped together. __builtin_ffs() is already used in arch/powerpc/include/asm/page_32.h Those builtins have been in GCC since at least 3.4.6 (see https://gcc.gnu.org/onlinedocs/gcc-3.4.6/gcc/Other-Builtins.html ) Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Christophe Leroy authored
It often happens to have simultaneous interrupts, for instance when having double Ethernet attachment. With the current implementation, we suffer the cost of kernel entry/exit for each interrupt. This patch introduces a loop in __do_irq() to handle all interrupts at once before returning. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Christophe Leroy authored
IRQ 0 is a valid HW interrupt. So get_irq() shall return 0 when there is no irq, instead of returning irq_linear_revmap(... ,0) Fixes: f2a0bd37 ("[POWERPC] 8xx: powerpc port of core CPM PIC") Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Christophe Leroy authored
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-