1. 01 Oct, 2022 1 commit
  2. 26 Sep, 2022 6 commits
    • Yury Norov's avatar
      cpumask: add cpumask_nth_{,and,andnot} · 944c417d
      Yury Norov authored
      Add cpumask_nth_{,and,andnot} as wrappers around corresponding
      find functions, and use it in cpumask_local_spread().
      Signed-off-by: default avatarYury Norov <yury.norov@gmail.com>
      944c417d
    • Yury Norov's avatar
      lib/bitmap: remove bitmap_ord_to_pos · 97848c10
      Yury Norov authored
      Now that we have find_nth_bit(), we can drop bitmap_ord_to_pos().
      Signed-off-by: default avatarYury Norov <yury.norov@gmail.com>
      97848c10
    • Yury Norov's avatar
      lib/bitmap: add tests for find_nth_bit() · e3783c80
      Yury Norov authored
      Add functional and performance tests for find_nth_bit().
      Signed-off-by: default avatarYury Norov <yury.norov@gmail.com>
      e3783c80
    • Yury Norov's avatar
      lib: add find_nth{,_and,_andnot}_bit() · 3cea8d47
      Yury Norov authored
      Kernel lacks for a function that searches for Nth bit in a bitmap.
      Usually people do it like this:
      	for_each_set_bit(bit, mask, size)
      		if (n-- == 0)
      			return bit;
      
      We can do it more efficiently, if we:
      1. find a word containing Nth bit, using hweight(); and
      2. find the bit, using a helper fns(), that works similarly to
         __ffs() and ffz().
      
      fns() is implemented as a simple loop. For x86_64, there's PDEP instruction
      to do that: ret = clz(pdep(1 << idx, num)). However, for large bitmaps the
      most of improvement comes from using hweight(), so I kept fns() simple.
      
      New find_nth_bit() is ~70 times faster on x86_64/kvm in find_bit benchmark:
      find_nth_bit:                  7154190 ns,  16411 iterations
      for_each_bit:                505493126 ns,  16315 iterations
      
      With all that, a family of 3 new functions is added, and used where
      appropriate in the following patches.
      Signed-off-by: default avatarYury Norov <yury.norov@gmail.com>
      3cea8d47
    • Yury Norov's avatar
      lib/bitmap: add bitmap_weight_and() · 24291caf
      Yury Norov authored
      The function calculates Hamming weight of (bitmap1 & bitmap2). Now we
      have to do like this:
      	tmp = bitmap_alloc(nbits);
      	bitmap_and(tmp, map1, map2, nbits);
      	weight = bitmap_weight(tmp, nbits);
      	bitmap_free(tmp);
      
      This requires additional memory, adds pressure on alloc subsystem, and
      way less cache-friendly than just:
      	weight = bitmap_weight_and(map1, map2, nbits);
      
      The following patches apply it for cpumask functions.
      Signed-off-by: default avatarYury Norov <yury.norov@gmail.com>
      24291caf
    • Yury Norov's avatar
      lib/bitmap: don't call __bitmap_weight() in kernel code · 70a1cb10
      Yury Norov authored
      __bitmap_weight() is not to be used directly in the kernel code because
      it's a helper for bitmap_weight(). Switch everything to bitmap_weight().
      Signed-off-by: default avatarYury Norov <yury.norov@gmail.com>
      70a1cb10
  3. 21 Sep, 2022 4 commits
    • Yury Norov's avatar
      tools: sync find_bit() implementation · 6333cb31
      Yury Norov authored
      Sync find_first_bit() and find_next_bit() implementation with the
      mother kernel.
      
      Also, drop unused find_last_bit() and find_next_clump8().
      Signed-off-by: default avatarYury Norov <yury.norov@gmail.com>
      6333cb31
    • Yury Norov's avatar
      lib/find_bit: optimize find_next_bit() functions · e79864f3
      Yury Norov authored
      Over the past couple years, the function _find_next_bit() was extended
      with parameters that modify its behavior to implement and- zero- and le-
      flavors. The parameters are passed at compile time, but current design
      prevents a compiler from optimizing out the conditionals.
      
      As find_next_bit() API grows, I expect that more parameters will be added.
      Current design would require more conditional code in _find_next_bit(),
      which would bloat the helper even more and make it barely readable.
      
      This patch replaces _find_next_bit() with a macro FIND_NEXT_BIT, and adds
      a set of wrappers, so that the compile-time optimizations become possible.
      
      The common logic is moved to the new macro, and all flavors may be
      generated by providing a FETCH macro parameter, like in this example:
      
        #define FIND_NEXT_BIT(FETCH, MUNGE, size, start) ...
      
        find_next_xornot_and_bit(addr1, addr2, addr3, size, start)
        {
      	return FIND_NEXT_BIT(addr1[idx] ^ ~addr2[idx] & addr3[idx],
      				/* nop */, size, start);
        }
      
      The FETCH may be of any complexity, as soon as it only refers the bitmap(s)
      and an iterator idx.
      
      MUNGE is here to support _le code generation for BE builds. May be
      empty.
      
      I ran find_bit_benchmark 16 times on top of 6.0-rc2 and 16 times on top
      of 6.0-rc2 + this series. The results for kvm/x86_64 are:
      
                            v6.0-rc2  Optimized       Difference  Z-score
      Random dense bitmap         ns         ns        ns      %
      find_next_bit:          787735     670546    117189   14.9     3.97
      find_next_zero_bit:     777492     664208    113284   14.6    10.51
      find_last_bit:          830925     687573    143352   17.3     2.35
      find_first_bit:        3874366    3306635    567731   14.7     1.84
      find_first_and_bit:   40677125   37739887   2937238    7.2     1.36
      find_next_and_bit:      347865     304456     43409   12.5     1.35
      
      Random sparse bitmap
      find_next_bit:           19816      14021      5795   29.2     6.10
      find_next_zero_bit:    1318901    1223794     95107    7.2     1.41
      find_last_bit:           14573      13514      1059    7.3     6.92
      find_first_bit:        1313321    1249024     64297    4.9     1.53
      find_first_and_bit:       8921       8098       823    9.2     4.56
      find_next_and_bit:        9796       7176      2620   26.7     5.39
      
      Where the statistics is significant (z-score > 3), the improvement
      is ~15%.
      
      According to the bloat-o-meter, the Image size is 10-11K less:
      
      x86_64/defconfig:
      add/remove: 32/14 grow/shrink: 61/782 up/down: 6344/-16521 (-10177)
      
      arm64/defconfig:
      add/remove: 3/2 grow/shrink: 50/714 up/down: 608/-11556 (-10948)
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarYury Norov <yury.norov@gmail.com>
      e79864f3
    • Yury Norov's avatar
      lib/find_bit: create find_first_zero_bit_le() · 14a99e13
      Yury Norov authored
      find_first_zero_bit_le() is an alias to find_next_zero_bit_le(),
      despite that 'next' is known to be slower than 'first' version.
      
      Now that we have common FIND_FIRST_BIT() macro helper, it's trivial
      to implement find_first_zero_bit_le() as a real function.
      Reviewed-by: default avatarValentin Schneider <vschneid@redhat.com>
      Signed-off-by: default avatarYury Norov <yury.norov@gmail.com>
      14a99e13
    • Yury Norov's avatar
      lib/find_bit: introduce FIND_FIRST_BIT() macro · 58414bbb
      Yury Norov authored
      Now that we have many flavors of find_first_bit(), and expect even more,
      it's better to have one macro that generates optimal code for all and makes
      maintaining of slightly different functions simpler.
      
      The logic common to all versions is moved to the new macro, and all the
      flavors are generated by providing an FETCH macro-parameter, like
      in this example:
      
        #define FIND_FIRST_BIT(FETCH, MUNGE, size) ...
      
        find_first_ornot_and_bit(addr1, addr2, addr3, size)
        {
              return FIND_FIRST_BIT(addr1[idx] | ~addr2[idx] & addr3[idx], /* nop */, size);
        }
      
      The FETCH may be of any complexity, as soon as it only refers
      the bitmap(s) and an iterator idx.
      
      MUNGE is here to support _le code generation for BE builds. May be
      empty.
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: default avatarValentin Schneider <vschneid@redhat.com>
      Signed-off-by: default avatarYury Norov <yury.norov@gmail.com>
      58414bbb
  4. 20 Sep, 2022 6 commits
    • Yury Norov's avatar
      lib/cpumask: add FORCE_NR_CPUS config option · 6f9c07be
      Yury Norov authored
      The size of cpumasks is hard-limited by compile-time parameter NR_CPUS,
      but defined at boot-time when kernel parses ACPI/DT tables, and stored in
      nr_cpu_ids. In many practical cases, number of CPUs for a target is known
      at compile time, and can be provided with NR_CPUS.
      
      In that case, compiler may be instructed to rely on NR_CPUS as on actual
      number of CPUs, not an upper limit. It allows to optimize many cpumask
      routines and significantly shrink size of the kernel image.
      
      This patch adds FORCE_NR_CPUS option to teach the compiler to rely on
      NR_CPUS and enable corresponding optimizations.
      
      If FORCE_NR_CPUS=y, kernel will not set nr_cpu_ids at boot, but only check
      that the actual number of possible CPUs is equal to NR_CPUS, and WARN if
      that doesn't hold.
      
      The new option is especially useful in embedded applications because
      kernel configurations are unique for each SoC, the number of CPUs is
      constant and known well, and memory limitations are typically harder.
      
      For my 4-CPU ARM64 build with NR_CPUS=4, FORCE_NR_CPUS=y saves 46KB:
        add/remove: 3/4 grow/shrink: 46/729 up/down: 652/-46952 (-46300)
      Signed-off-by: default avatarYury Norov <yury.norov@gmail.com>
      6f9c07be
    • Yury Norov's avatar
      powerpc/64: don't refer nr_cpu_ids in asm code when it's undefined · 546a073d
      Yury Norov authored
      generic_secondary_common_init() calls LOAD_REG_ADDR(r7, nr_cpu_ids)
      conditionally on CONFIG_SMP. However, if 'NR_CPUS == 1', kernel doesn't
      use the nr_cpu_ids, and in C code, it's just:
        #if NR_CPUS == 1
        #define nr_cpu_ids
        ...
      
      This series makes declaration of nr_cpu_ids conditional on NR_CPUS == 1,
      and that reveals the issue, because compiler can't link the
      LOAD_REG_ADDR(r7, nr_cpu_ids) against nonexisting symbol.
      
      Current code looks unsafe for those who build kernel with CONFIG_SMP=y and
      NR_CPUS == 1. This is weird configuration, but not disallowed.
      
      Fix the linker error by replacing LOAD_REG_ADDR() with LOAD_REG_IMMEDIATE()
      conditionally on NR_CPUS == 1.
      
      As the following patch adds CONFIG_FORCE_NR_CPUS option that has the
      similar effect on nr_cpu_ids, make the generic_secondary_common_init()
      conditional on it too.
      Reported-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarYury Norov <yury.norov@gmail.com>
      546a073d
    • Yury Norov's avatar
      lib/cpumask: deprecate nr_cpumask_bits · aa47a7c2
      Yury Norov authored
      Cpumask code is written in assumption that when CONFIG_CPUMASK_OFFSTACK
      is enabled, all cpumasks have boot-time defined size, otherwise the size
      is always NR_CPUS.
      
      The latter is wrong because the number of possible cpus is always
      calculated on boot, and it may be less than NR_CPUS.
      
      On my 4-cpu arm64 VM the nr_cpu_ids is 4, as expected, and nr_cpumask_bits
      is 256, which corresponds to NR_CPUS. This not only leads to useless
      traversing of cpumask bits greater than 4, this also makes some cpumask
      routines fail.
      
      For example, cpumask_full(0b1111000..000) would erroneously return false
      in the example above because tail bits in the mask are all unset.
      
      This patch deprecates nr_cpumask_bits and wires it to nr_cpu_ids
      unconditionally, so that cpumask routines will not waste time traversing
      unused part of cpu masks. It also fixes cpumask_full() and similar
      routines.
      
      As a side effect, because now a length of cpumasks is defined at run-time
      even if CPUMASK_OFFSTACK is disabled, compiler can't optimize corresponding
      functions.
      
      It increases kernel size by ~2.5KB if OFFSTACK is off. This is addressed in
      the following patch.
      Signed-off-by: default avatarYury Norov <yury.norov@gmail.com>
      aa47a7c2
    • Yury Norov's avatar
      lib/cpumask: delete misleading comment · 7102b3bb
      Yury Norov authored
      The comment says that HOTPLUG config option enables all cpus in
      cpu_possible_mask up to NR_CPUs. This is wrong. Even if HOTPLUG is
      enabled, the mask is populated on boot with respect to ACPI/DT records.
      Signed-off-by: default avatarYury Norov <yury.norov@gmail.com>
      7102b3bb
    • Yury Norov's avatar
      smp: add set_nr_cpu_ids() · 38bef8e5
      Yury Norov authored
      In preparation to support compile-time nr_cpu_ids, add a setter for
      the variable.
      
      This is a no-op for all arches.
      Signed-off-by: default avatarYury Norov <yury.norov@gmail.com>
      38bef8e5
    • Yury Norov's avatar
      smp: don't declare nr_cpu_ids if NR_CPUS == 1 · 53fc190c
      Yury Norov authored
      SMP and NR_CPUS are independent options, hence nr_cpu_ids may be
      declared even if NR_CPUS == 1, which is useless.
      Signed-off-by: default avatarYury Norov <yury.norov@gmail.com>
      53fc190c
  5. 09 Sep, 2022 1 commit
    • Phil Auld's avatar
      drivers/base: Fix unsigned comparison to -1 in CPUMAP_FILE_MAX_BYTES · b9be19ee
      Phil Auld authored
      As PAGE_SIZE is unsigned long, -1 > PAGE_SIZE when NR_CPUS <= 3.
      This leads to very large file sizes:
      
      topology$ ls -l
      total 0
      -r--r--r-- 1 root root 18446744073709551615 Sep  5 11:59 core_cpus
      -r--r--r-- 1 root root                 4096 Sep  5 11:59 core_cpus_list
      -r--r--r-- 1 root root                 4096 Sep  5 10:58 core_id
      -r--r--r-- 1 root root 18446744073709551615 Sep  5 10:10 core_siblings
      -r--r--r-- 1 root root                 4096 Sep  5 11:59 core_siblings_list
      -r--r--r-- 1 root root 18446744073709551615 Sep  5 11:59 die_cpus
      -r--r--r-- 1 root root                 4096 Sep  5 11:59 die_cpus_list
      -r--r--r-- 1 root root                 4096 Sep  5 11:59 die_id
      -r--r--r-- 1 root root 18446744073709551615 Sep  5 11:59 package_cpus
      -r--r--r-- 1 root root                 4096 Sep  5 11:59 package_cpus_list
      -r--r--r-- 1 root root                 4096 Sep  5 10:58 physical_package_id
      -r--r--r-- 1 root root 18446744073709551615 Sep  5 10:10 thread_siblings
      -r--r--r-- 1 root root                 4096 Sep  5 11:59 thread_siblings_list
      
      Adjust the inequality to catch the case when NR_CPUS is configured
      to a small value.
      
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Yury Norov <yury.norov@gmail.com>
      Cc: stable@vger.kernel.org
      Cc: feng xiangjun <fengxj325@gmail.com>
      Fixes: 7ee951ac ("drivers/base: fix userspace break from using bin_attributes for cpumap and cpulist")
      Reported-by: default avatarfeng xiangjun <fengxj325@gmail.com>
      Signed-off-by: default avatarPhil Auld <pauld@redhat.com>
      Signed-off-by: default avatarYury Norov <yury.norov@gmail.com>
      b9be19ee
  6. 04 Sep, 2022 5 commits
    • Linus Torvalds's avatar
      Linux 6.0-rc4 · 7e18e42e
      Linus Torvalds authored
      7e18e42e
    • Linus Torvalds's avatar
      Merge tag 'powerpc-6.0-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · 59954972
      Linus Torvalds authored
      Pull powerpc fixes from Michael Ellerman:
      
       - Fix handling of PCI domains in /proc on 32-bit systems using the
         recently added support for numbering buses from zero for each domain.
      
       - A fix and a revert for some changes to use READ/WRITE_ONCE() which
         caused problems with KASAN enabled due to sanitisation calls being
         introduced in low-level paths that can't cope with it.
      
       - Fix build errors on 32-bit caused by the syscall table being
         misaligned sometimes.
      
       - Two fixes to get IBM Cell native machines booting again, which had
         bit-rotted while my QS22 was temporarily out of action.
      
       - Fix the papr_scm driver to not assume the order of events returned by
         the hypervisor is stable, and a related compile fix.
      
      Thanks to Aneesh Kumar K.V, Christophe Leroy, Jordan Niethe, Kajol Jain,
      Masahiro Yamada, Nathan Chancellor, Pali Rohár, Vaibhav Jain, and Zhouyi
      Zhou.
      
      * tag 'powerpc-6.0-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
        powerpc/papr_scm: Ensure rc is always initialized in papr_scm_pmu_register()
        Revert "powerpc/irq: Don't open code irq_soft_mask helpers"
        powerpc: Fix hard_irq_disable() with sanitizer
        powerpc/rtas: Fix RTAS MSR[HV] handling for Cell
        Revert "powerpc: Remove unused FW_FEATURE_NATIVE references"
        powerpc: align syscall table for ppc32
        powerpc/pci: Enable PCI domains in /proc when PCI bus numbers are not unique
        powerpc/papr_scm: Fix nvdimm event mappings
      59954972
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · 685ed983
      Linus Torvalds authored
      Pull kvm fixes from Paolo Bonzini:
       "s390:
      
         - PCI interpretation compile fixes
      
        RISC-V:
      
         - fix unused variable warnings in vcpu_timer.c
      
         - move extern sbi_ext declarations to a header
      
        x86:
      
         - check validity of argument to KVM_SET_MP_STATE
      
         - use guest's global_ctrl to completely disable guest PEBS
      
         - fix a memory leak on memory allocation failure
      
         - mask off unsupported and unknown bits of IA32_ARCH_CAPABILITIES
      
         - fix build failure with Clang integrated assembler
      
         - fix MSR interception
      
         - always flush TLBs when enabling dirty logging"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
        KVM: x86: check validity of argument to KVM_SET_MP_STATE
        perf/x86/core: Completely disable guest PEBS via guest's global_ctrl
        KVM: x86: fix memoryleak in kvm_arch_vcpu_create()
        KVM: x86: Mask off unsupported and unknown bits of IA32_ARCH_CAPABILITIES
        KVM: s390: pci: Hook to access KVM lowlevel from VFIO
        riscv: kvm: move extern sbi_ext declarations to a header
        riscv: kvm: vcpu_timer: fix unused variable warnings
        KVM: selftests: Fix ambiguous mov in KVM_ASM_SAFE()
        KVM: selftests: Fix KVM_EXCEPTION_MAGIC build with Clang
        KVM: VMX: Heed the 'msr' argument in msr_write_intercepted()
        kvm: x86: mmu: Always flush TLBs when enabling dirty logging
        kvm: x86: mmu: Drop the need_remote_flush() function
      685ed983
    • Nick Desaulniers's avatar
      Makefile.extrawarn: re-enable -Wformat for clang; take 2 · b0839b28
      Nick Desaulniers authored
      -Wformat was recently re-enabled for builds with clang, then quickly
      re-disabled, due to concerns stemming from the frequency of default
      argument promotion related warning instances.
      
      commit 258fafcd ("Makefile.extrawarn: re-enable -Wformat for clang")
      commit 21f9c8a1 ("Revert "Makefile.extrawarn: re-enable -Wformat for clang"")
      
      ISO WG14 has ratified N2562 to address default argument promotion
      explicitly for printf, as part of the upcoming ISO C2X standard.
      
      The behavior of clang was changed in clang-16 to not warn for the cited
      cases in all language modes.
      
      Add a version check, so that users of clang-16 now get the full effect
      of -Wformat. For older clang versions, re-enable flags under the
      -Wformat group that way users still get some useful checks related to
      format strings, without noisy default argument promotion warnings. I
      intentionally omitted -Wformat-y2k and -Wformat-security from being
      re-enabled, which are also part of -Wformat in clang-16.
      
      Link: https://github.com/ClangBuiltLinux/linux/issues/378
      Link: https://github.com/llvm/llvm-project/issues/57102
      Link: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2562.pdfSuggested-by: default avatarJustin Stitt <jstitt007@gmail.com>
      Suggested-by: default avatarNathan Chancellor <nathan@kernel.org>
      Suggested-by: default avatarYoungmin Nam <youngmin.nam@samsung.com>
      Signed-off-by: default avatarNick Desaulniers <ndesaulniers@google.com>
      Reviewed-by: default avatarMasahiro Yamada <masahiroy@kernel.org>
      Reviewed-by: default avatarNathan Chancellor <nathan@kernel.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b0839b28
    • Linus Torvalds's avatar
      Merge tag 'gpio-fixes-for-v6.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux · 7726d4c3
      Linus Torvalds authored
      Pull gpio fixes from Bartosz Golaszewski:
       "A a set of fixes from the GPIO subsystem.
      
        Most are small driver fixes except the realtek-otto driver patch which
        is pretty big but addresses a significant flaw that can cause the CPU
        to stay infinitely busy on uncleared ISR on some platforms.
      
        Summary:
         - MAINTAINERS update
         - fix resource leaks in gpio-mockup and gpio-pxa
         - add missing locking in gpio-pca953x
         - use 32-bit I/O in gpio-realtek-otto
         - make irq_chip structures immutable in four more drivers"
      
      * tag 'gpio-fixes-for-v6.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
        gpio: ws16c48: Make irq_chip immutable
        gpio: 104-idio-16: Make irq_chip immutable
        gpio: 104-idi-48: Make irq_chip immutable
        gpio: 104-dio-48e: Make irq_chip immutable
        gpio: realtek-otto: switch to 32-bit I/O
        gpio: pca953x: Add mutex_lock for regcache sync in PM
        gpio: mockup: remove gpio debugfs when remove device
        gpio: pxa: use devres for the clock struct
        MAINTAINERS: rectify entry for XILINX GPIO DRIVER
      7726d4c3
  7. 03 Sep, 2022 17 commits