1. 14 May, 2024 6 commits
    • Huacai Chen's avatar
      LoongArch: Fix callchain parse error with kernel tracepoint events again · d6af2c76
      Huacai Chen authored
      With commit d3119bc9 ("LoongArch: Fix callchain parse error with
      kernel tracepoint events"), perf can parse kernel callchain, but not
      complete and sometimes maybe error. The reason is LoongArch's unwinders
      (guess, prologue and orc) don't really need fp (i.e., regs[22]), and
      they use sp (i.e., regs[3]) as the frame address rather than the current
      stack pointer.
      
      Fix that by removing the assignment of regs[22], and instead assign the
      __builtin_frame_address(0) to regs[3].
      
      Without fix:
      
        Children      Self  Command        Shared Object      Symbol
        ........  ........  .............  .................  ................
        33.91%    33.91%    swapper        [kernel.vmlinux]   [k] __schedule
                  |
                  |--33.04%--__schedule
                  |
                   --0.87%--__arch_cpu_idle
                             __schedule
      
      With this fix:
      
        Children      Self  Command        Shared Object      Symbol
        ........  ........  .............  .................  ................
        31.16%    31.16%    swapper        [kernel.vmlinux]   [k] __schedule
                  |
                  |--20.63%--smpboot_entry
                  |          cpu_startup_entry
                  |          schedule_idle
                  |          __schedule
                  |
                   --10.53%--start_kernel
                             cpu_startup_entry
                             schedule_idle
                             __schedule
      
      Fixes: d3119bc9 ("LoongArch: Fix callchain parse error with kernel tracepoint events")
      Signed-off-by: default avatarHuacai Chen <chenhuacai@loongson.cn>
      d6af2c76
    • Tiezhu Yang's avatar
      LoongArch: Give a chance to build with !CONFIG_SMP · 5685d7fc
      Tiezhu Yang authored
      In the current code, SMP is selected in Kconfig for LoongArch, the users
      can not unset it, this is reasonable for a multi-processor machine. But
      as the help info of config SMP said, if you have a system with only one
      CPU, say N. On a uni-processor machine, the kernel will run faster if you
      say N here.
      
      Loongson-2K0500 is a single-core CPU for applications like industrial
      control, printing terminals, and BMC (Baseboard Management Controller),
      there are many development boards, products and solutions on the market,
      so it is better and necessary to give a chance to build with !CONFIG_SMP
      for a uni-processor machine.
      
      First of all, do not select SMP for config LOONGARCH in Kconfig to make
      it possible to unset CONFIG_SMP. Then, do some changes to fix warnings
      and errors if CONFIG_SMP is not set.
      
      (1) Define get_ipi_irq() only if CONFIG_SMP is set to fix the warning:
      arch/loongarch/kernel/irq.c:90:19: warning: 'get_ipi_irq' defined but not used [-Wunused-function]
      
      (2) Add "#ifdef CONFIG_SMP" in asm/smp.h to fix the warning:
      ./arch/loongarch/include/asm/smp.h:49:9: warning: "raw_smp_processor_id" redefined
         49 | #define raw_smp_processor_id raw_smp_processor_id
            |         ^~~~~~~~~~~~~~~~~~~~
      ./include/linux/smp.h:198:9: note: this is the location of the previous definition
        198 | #define raw_smp_processor_id()                  0
      
      (3) Define machine_shutdown() as empty under !CONFIG_SMP to fix the error:
      arch/loongarch/kernel/machine_kexec.c: In function 'machine_shutdown':
      arch/loongarch/kernel/machine_kexec.c:233:25: error: implicit declaration of function 'cpu_device_up'; did you mean 'put_device'? [-Wimplicit-function-declaration]
      
      (4) Make config SCHED_SMT depends on SMP to fix many errors such as:
      kernel/sched/core.c: In function 'sched_core_find':
      kernel/sched/core.c:310:43: error: 'struct rq' has no member named 'cpu'
      
      (5) Define cpu_logical_map(cpu) as 0 under !CONFIG_SMP in asm/smp.h,
      then include asm/smp.h in asm/acpi.h (because acpi.h is included in
      linux/irq.h indirectly) to fix many build errors under drivers/irqchip
      such as:
      drivers/irqchip/irq-loongson-eiointc.c: In function 'cpu_to_eio_node':
      drivers/irqchip/irq-loongson-eiointc.c:59:16: error: implicit declaration of function 'cpu_logical_map' [-Wimplicit-function-declaration]
      
      (6) Do not write per_cpu_offset(0) to PERCPU_BASE_KS when resume because
      the per_cpu_offset(x) macro is defined as (__per_cpu_offset[x]) only
      under CONFIG_SMP in include/asm-generic/percpu.h. Just save the value of
      PERCPU_BASE_KS when suspend and restore it when resume to fix the error:
      arch/loongarch/power/suspend.c: In function 'loongarch_common_resume':
      arch/loongarch/power/suspend.c:47:21: error: implicit declaration of function 'per_cpu_offset' [-Wimplicit-function-declaration]
      
      (7) Fix huge page handling under !CONFIG_SMP in tlbex.S.
      
      When running the UnixBench tests with "-c 1" single-streamed pass, the
      improvement of performance is about 9 percent with this patch.
      
      By the way, it is helpful to debug and analysis the kernel issues of
      multi-processor system under !CONFIG_SMP.
      Signed-off-by: default avatarTiezhu Yang <yangtiezhu@loongson.cn>
      Signed-off-by: default avatarHuacai Chen <chenhuacai@loongson.cn>
      5685d7fc
    • Huacai Chen's avatar
      LoongArch: Select THP_SWAP if HAVE_ARCH_TRANSPARENT_HUGEPAGE · ff4a2443
      Huacai Chen authored
      THP_SWAP has been proven to improve the swap throughput significantly on
      x86_64 system according to commit bd4c82c2 ("mm, THP, swap: delay
      splitting THP after swapped out"), on ARM64 system according to commit
      d0637c50 ("arm64: enable THP_SWAP for arm64") and on RISC-V system
      according to commit 87f81e66 ("riscv: enable THP_SWAP for RV64").
      
      Enable THP_SWAP for LoongArch, testing the micro-benchmark which is
      introduced by commit d0637c50 ("arm64: enable THP_SWAP for arm64")
      shows below numbers on the Loongson-3A5000 board:
      
      swp out bandwidth w/o patch: 1815716 bytes/ms (mean of 10 tests)
      swp out bandwidth w/  patch: 3410003 bytes/ms (mean of 10 tests)
      
      Improved by 46.75%!
      Signed-off-by: default avatarHuacai Chen <chenhuacai@loongson.cn>
      ff4a2443
    • Huacai Chen's avatar
      LoongArch: Select ARCH_WANT_DEFAULT_BPF_JIT · d0b35b02
      Huacai Chen authored
      BPF JIT has better performance and more secure than BPF interpreter, so
      enable it by default, as most other architectures done.
      Signed-off-by: default avatarHuacai Chen <chenhuacai@loongson.cn>
      d0b35b02
    • Xi Ruoyao's avatar
      LoongArch: Select ARCH_SUPPORTS_INT128 if CC_HAS_INT128 · 5125d033
      Xi Ruoyao authored
      This allows compiling a full 128-bit product of two 64-bit integers as a
      mul/mulh pair, instead of a nasty long sequence of 20+ instructions.
      
      However, after selecting ARCH_SUPPORTS_INT128, when optimizing for size
      the compiler generates calls to __ashlti3, __ashrti3, and __lshrti3 for
      shifting __int128 values, causing a link failure:
      
          loongarch64-unknown-linux-gnu-ld: kernel/sched/fair.o: in
          function `mul_u64_u32_shr':
          <PATH>/include/linux/math64.h:161:(.text+0x5e4): undefined
          reference to `__lshrti3'
      
      So provide the implementation of these functions if ARCH_SUPPORTS_INT128.
      
      Closes: https://lore.kernel.org/loongarch/CAAhV-H5EZ=7OF7CSiYyZ8_+wWuenpo=K2WT8-6mAT4CvzUC_4g@mail.gmail.com/Signed-off-by: default avatarXi Ruoyao <xry111@xry111.site>
      Signed-off-by: default avatarHuacai Chen <chenhuacai@loongson.cn>
      5125d033
    • Xi Ruoyao's avatar
      LoongArch: Select ARCH_HAS_FAST_MULTIPLIER · 2cce9059
      Xi Ruoyao authored
      LA464 and LA664 can do 32-bit/64-bit integer multiplication with a
      latency of 4 cycles and a throughput of 2 ops per cycle.  It is
      comparable to the mainstream x86 and arm64 cores, so we can select
      ARCH_HAS_FAST_MULTIPLIER like them.
      
      It speeds up __sw_hweight32() in lib/hweight.c for about 14% on LA464
      and 11% on LA664, while __sw_hweight64() for about 30% on LA464 and 33%
      on LA664.
      Signed-off-by: default avatarXi Ruoyao <xry111@xry111.site>
      Signed-off-by: default avatarHuacai Chen <chenhuacai@loongson.cn>
      2cce9059
  2. 12 May, 2024 5 commits
  3. 11 May, 2024 10 commits
  4. 10 May, 2024 19 commits