1. 27 Sep, 2022 4 commits
    • Jiri Olsa's avatar
      bpf: Check flags for branch stack in bpf_read_branch_records helper · cce6a2d7
      Jiri Olsa authored
      Recent commit [1] changed branch stack data indication from
      br_stack pointer to sample_flags in perf_sample_data struct.
      
      We need to check sample_flags for PERF_SAMPLE_BRANCH_STACK
      bit for valid branch stack data.
      
      [1] a9a931e2 ("perf: Use sample_flags for branch stack")
      
      Fixes: a9a931e2 ("perf: Use sample_flags for branch stack")
      Signed-off-by: default avatarJiri Olsa <jolsa@kernel.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Link: https://lore.kernel.org/r/20220927203259.590950-1-jolsa@kernel.org
      cce6a2d7
    • Marco Elver's avatar
      perf, hw_breakpoint: Fix use-after-free if perf_event_open() fails · 4674ffe2
      Marco Elver authored
      Local testing revealed that we can trigger a use-after-free during
      rhashtable lookup as follows:
      
       | BUG: KASAN: use-after-free in memcmp lib/string.c:757
       | Read of size 8 at addr ffff888107544dc0 by task perf-rhltable-n/1293
       |
       | CPU: 0 PID: 1293 Comm: perf-rhltable-n Not tainted 6.0.0-rc3-00014-g85260862789c #46
       | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-4 04/01/2014
       | Call Trace:
       |  <TASK>
       |  memcmp			lib/string.c:757
       |  rhashtable_compare		include/linux/rhashtable.h:577 [inline]
       |  __rhashtable_lookup		include/linux/rhashtable.h:602 [inline]
       |  rhltable_lookup		include/linux/rhashtable.h:688 [inline]
       |  task_bp_pinned		kernel/events/hw_breakpoint.c:324
       |  toggle_bp_slot		kernel/events/hw_breakpoint.c:462
       |  __release_bp_slot		kernel/events/hw_breakpoint.c:631 [inline]
       |  release_bp_slot		kernel/events/hw_breakpoint.c:639
       |  register_perf_hw_breakpoint	kernel/events/hw_breakpoint.c:742
       |  hw_breakpoint_event_init	kernel/events/hw_breakpoint.c:976
       |  perf_try_init_event		kernel/events/core.c:11261
       |  perf_init_event		kernel/events/core.c:11325 [inline]
       |  perf_event_alloc		kernel/events/core.c:11619
       |  __do_sys_perf_event_open	kernel/events/core.c:12157
       |  do_syscall_x64 		arch/x86/entry/common.c:50 [inline]
       |  do_syscall_64		arch/x86/entry/common.c:80
       |  entry_SYSCALL_64_after_hwframe
       |  </TASK>
       |
       | Allocated by task 1292:
       |  perf_event_alloc		kernel/events/core.c:11505
       |  __do_sys_perf_event_open	kernel/events/core.c:12157
       |  do_syscall_x64		arch/x86/entry/common.c:50 [inline]
       |  do_syscall_64		arch/x86/entry/common.c:80
       |  entry_SYSCALL_64_after_hwframe
       |
       | Freed by task 1292:
       |  perf_event_alloc		kernel/events/core.c:11716
       |  __do_sys_perf_event_open	kernel/events/core.c:12157
       |  do_syscall_x64		arch/x86/entry/common.c:50 [inline]
       |  do_syscall_64		arch/x86/entry/common.c:80
       |  entry_SYSCALL_64_after_hwframe
       |
       | The buggy address belongs to the object at ffff888107544c00
       |  which belongs to the cache perf_event of size 1352
       | The buggy address is located 448 bytes inside of
       |  1352-byte region [ffff888107544c00, ffff888107545148)
      
      This happens because the first perf_event_open() managed to reserve a HW
      breakpoint slot, however, later fails for other reasons and returns. The
      second perf_event_open() runs concurrently, and during rhltable_lookup()
      looks up an entry which is being freed: since rhltable_lookup() may run
      concurrently (under the RCU read lock) with rhltable_remove(), we may
      end up with a stale entry, for which memory may also have already been
      freed when being accessed.
      
      To fix, only free the failed perf_event after an RCU grace period. This
      allows subsystems that store references to an event to always access it
      concurrently under the RCU read lock, even if initialization will fail.
      
      Given failure is unlikely and a slow-path, turning the immediate free
      into a call_rcu()-wrapped free does not affect performance elsewhere.
      
      Fixes: 0370dc31 ("perf/hw_breakpoint: Optimize list of per-task breakpoints")
      Reported-by: default avatarsyzkaller <syzkaller@googlegroups.com>
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20220927172025.1636995-1-elver@google.com
      4674ffe2
    • Namhyung Kim's avatar
      perf: Use sample_flags for raw_data · 838d9bb6
      Namhyung Kim authored
      Use the new sample_flags to indicate whether the raw data field is
      filled by the PMU driver.  Although it could check with the NULL,
      follow the same rule with other fields.
      
      Remove the raw field from the perf_sample_data_init() to minimize
      the number of cache lines touched.
      Signed-off-by: default avatarNamhyung Kim <namhyung@kernel.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20220921220032.2858517-2-namhyung@kernel.org
      838d9bb6
    • Namhyung Kim's avatar
      perf: Use sample_flags for addr · 7b084630
      Namhyung Kim authored
      Use the new sample_flags to indicate whether the addr field is filled by
      the PMU driver.  As most PMU drivers pass 0, it can set the flag only if
      it has a non-zero value.  And use 0 in perf_sample_output() if it's not
      filled already.
      Signed-off-by: default avatarNamhyung Kim <namhyung@kernel.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20220921220032.2858517-1-namhyung@kernel.org
      7b084630
  2. 21 Sep, 2022 1 commit
    • Jules Irenge's avatar
      perf/core: Convert snprintf() to scnprintf() · dca6344d
      Jules Irenge authored
      Coccinelle reports a warning:
      
          WARNING: use scnprintf or sprintf
      
      This LWN article explains the rationale for this change:
      
          https: //lwn.net/Articles/69419/
      
      Ie. snprintf() returns what *would* be the resulting length,
      while scnprintf() returns the actual length.
      
      Adding to that, there has also been some slow migration from snprintf to scnprintf,
      here's the shift in usage in the past 3.5 years, in all fs/ files:
      
                               v5.0    v6.0-rc6
         --------------------------------------
         snprintf() uses:        63         213
         scnprintf() uses:      374         186
      
      No intended change in behavior.
      
      [ mingo: Improved the changelog & reviewed the usage sites. ]
      Signed-off-by: default avatarJules Irenge <jbi.octave@gmail.com>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      dca6344d
  3. 13 Sep, 2022 3 commits
  4. 07 Sep, 2022 14 commits
  5. 06 Sep, 2022 6 commits
  6. 30 Aug, 2022 12 commits
    • Marco Elver's avatar
      perf/hw_breakpoint: Optimize toggle_bp_slot() for CPU-independent task targets · ecdfb889
      Marco Elver authored
      We can still see that a majority of the time is spent hashing task pointers:
      
          ...
          16.98%  [kernel]       [k] rhashtable_jhash2
          ...
      
      Doing the bookkeeping in toggle_bp_slots() is currently O(#cpus),
      calling task_bp_pinned() for each CPU, even if task_bp_pinned() is
      CPU-independent. The reason for this is to update the per-CPU
      'tsk_pinned' histogram.
      
      To optimize the CPU-independent case to O(1), keep a separate
      CPU-independent 'tsk_pinned_all' histogram.
      
      The major source of complexity are transitions between "all
      CPU-independent task breakpoints" and "mixed CPU-independent and
      CPU-dependent task breakpoints". The code comments list all cases that
      require handling.
      
      After this optimization:
      
       | $> perf bench -r 100 breakpoint thread -b 4 -p 128 -t 512
       | # Running 'breakpoint/thread' benchmark:
       | # Created/joined 100 threads with 4 breakpoints and 128 parallelism
       |      Total time: 1.758 [sec]
       |
       |       34.336621 usecs/op
       |     4395.087500 usecs/op/cpu
      
          38.08%  [kernel]       [k] queued_spin_lock_slowpath
          10.81%  [kernel]       [k] smp_cfm_core_cond
           3.01%  [kernel]       [k] update_sg_lb_stats
           2.58%  [kernel]       [k] osq_lock
           2.57%  [kernel]       [k] llist_reverse_order
           1.45%  [kernel]       [k] find_next_bit
           1.21%  [kernel]       [k] flush_tlb_func_common
           1.01%  [kernel]       [k] arch_install_hw_breakpoint
      
      Showing that the time spent hashing keys has become insignificant.
      
      With the given benchmark parameters, that's an improvement of 12%
      compared with the old O(#cpus) version.
      
      And finally, using the less aggressive parameters from the preceding
      changes, we now observe:
      
       | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
       | # Running 'breakpoint/thread' benchmark:
       | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
       |      Total time: 0.067 [sec]
       |
       |       35.292187 usecs/op
       |     2258.700000 usecs/op/cpu
      
      Which is an improvement of 12% compared to without the histogram
      optimizations (baseline is 40 usecs/op). This is now on par with the
      theoretical ideal (constraints disabled), and only 12% slower than no
      breakpoints at all.
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Acked-by: default avatarIan Rogers <irogers@google.com>
      Link: https://lore.kernel.org/r/20220829124719.675715-15-elver@google.com
      ecdfb889
    • Marco Elver's avatar
      perf/hw_breakpoint: Optimize max_bp_pinned_slots() for CPU-independent task targets · 9b1933b8
      Marco Elver authored
      Running the perf benchmark with (note: more aggressive parameters vs.
      preceding changes, but same 256 CPUs host):
      
       | $> perf bench -r 100 breakpoint thread -b 4 -p 128 -t 512
       | # Running 'breakpoint/thread' benchmark:
       | # Created/joined 100 threads with 4 breakpoints and 128 parallelism
       |      Total time: 1.989 [sec]
       |
       |       38.854160 usecs/op
       |     4973.332500 usecs/op/cpu
      
          20.43%  [kernel]       [k] queued_spin_lock_slowpath
          18.75%  [kernel]       [k] osq_lock
          16.98%  [kernel]       [k] rhashtable_jhash2
           8.34%  [kernel]       [k] task_bp_pinned
           4.23%  [kernel]       [k] smp_cfm_core_cond
           3.65%  [kernel]       [k] bcmp
           2.83%  [kernel]       [k] toggle_bp_slot
           1.87%  [kernel]       [k] find_next_bit
           1.49%  [kernel]       [k] __reserve_bp_slot
      
      We can see that a majority of the time is now spent hashing task
      pointers to index into task_bps_ht in task_bp_pinned().
      
      Obtaining the max_bp_pinned_slots() for CPU-independent task targets
      currently is O(#cpus), and calls task_bp_pinned() for each CPU, even if
      the result of task_bp_pinned() is CPU-independent.
      
      The loop in max_bp_pinned_slots() wants to compute the maximum slots
      across all CPUs. If task_bp_pinned() is CPU-independent, we can do so by
      obtaining the max slots across all CPUs and adding task_bp_pinned().
      
      To do so in O(1), use a bp_slots_histogram for CPU-pinned slots.
      
      After this optimization:
      
       | $> perf bench -r 100 breakpoint thread -b 4 -p 128 -t 512
       | # Running 'breakpoint/thread' benchmark:
       | # Created/joined 100 threads with 4 breakpoints and 128 parallelism
       |      Total time: 1.930 [sec]
       |
       |       37.697832 usecs/op
       |     4825.322500 usecs/op/cpu
      
          19.13%  [kernel]       [k] queued_spin_lock_slowpath
          18.21%  [kernel]       [k] rhashtable_jhash2
          15.46%  [kernel]       [k] osq_lock
           6.27%  [kernel]       [k] toggle_bp_slot
           5.91%  [kernel]       [k] task_bp_pinned
           5.05%  [kernel]       [k] smp_cfm_core_cond
           1.78%  [kernel]       [k] update_sg_lb_stats
           1.36%  [kernel]       [k] llist_reverse_order
           1.34%  [kernel]       [k] find_next_bit
           1.19%  [kernel]       [k] bcmp
      
      Suggesting that time spent in task_bp_pinned() has been reduced.
      However, we're still hashing too much, which will be addressed in the
      subsequent change.
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Acked-by: default avatarIan Rogers <irogers@google.com>
      Link: https://lore.kernel.org/r/20220829124719.675715-14-elver@google.com
      9b1933b8
    • Marco Elver's avatar
      perf/hw_breakpoint: Introduce bp_slots_histogram · 16db2839
      Marco Elver authored
      Factor out the existing `atomic_t count[N]` into its own struct called
      'bp_slots_histogram', to generalize and make its intent clearer in
      preparation of reusing elsewhere. The basic idea of bucketing "total
      uses of N slots" resembles a histogram, so calling it such seems most
      intuitive.
      
      No functional change.
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Acked-by: default avatarIan Rogers <irogers@google.com>
      Link: https://lore.kernel.org/r/20220829124719.675715-13-elver@google.com
      16db2839
    • Marco Elver's avatar
      perf/hw_breakpoint: Reduce contention with large number of tasks · 0912037f
      Marco Elver authored
      While optimizing task_bp_pinned()'s runtime complexity to O(1) on
      average helps reduce time spent in the critical section, we still suffer
      due to serializing everything via 'nr_bp_mutex'. Indeed, a profile shows
      that now contention is the biggest issue:
      
          95.93%  [kernel]       [k] osq_lock
           0.70%  [kernel]       [k] mutex_spin_on_owner
           0.22%  [kernel]       [k] smp_cfm_core_cond
           0.18%  [kernel]       [k] task_bp_pinned
           0.18%  [kernel]       [k] rhashtable_jhash2
           0.15%  [kernel]       [k] queued_spin_lock_slowpath
      
      when running the breakpoint benchmark with (system with 256 CPUs):
      
       | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
       | # Running 'breakpoint/thread' benchmark:
       | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
       |      Total time: 0.207 [sec]
       |
       |      108.267188 usecs/op
       |     6929.100000 usecs/op/cpu
      
      The main concern for synchronizing the breakpoint constraints data is
      that a consistent snapshot of the per-CPU and per-task data is observed.
      
      The access pattern is as follows:
      
       1. If the target is a task: the task's pinned breakpoints are counted,
          checked for space, and then appended to; only bp_cpuinfo::cpu_pinned
          is used to check for conflicts with CPU-only breakpoints;
          bp_cpuinfo::tsk_pinned are incremented/decremented, but otherwise
          unused.
      
       2. If the target is a CPU: bp_cpuinfo::cpu_pinned are counted, along
          with bp_cpuinfo::tsk_pinned; after a successful check, cpu_pinned is
          incremented. No per-task breakpoints are checked.
      
      Since rhltable safely synchronizes insertions/deletions, we can allow
      concurrency as follows:
      
       1. If the target is a task: independent tasks may update and check the
          constraints concurrently, but same-task target calls need to be
          serialized; since bp_cpuinfo::tsk_pinned is only updated, but not
          checked, these modifications can happen concurrently by switching
          tsk_pinned to atomic_t.
      
       2. If the target is a CPU: access to the per-CPU constraints needs to
          be serialized with other CPU-target and task-target callers (to
          stabilize the bp_cpuinfo::tsk_pinned snapshot).
      
      We can allow the above concurrency by introducing a per-CPU constraints
      data reader-writer lock (bp_cpuinfo_sem), and per-task mutexes (reuses
      task_struct::perf_event_mutex):
      
        1. If the target is a task: acquires perf_event_mutex, and acquires
           bp_cpuinfo_sem as a reader. The choice of percpu-rwsem minimizes
           contention in the presence of many read-lock but few write-lock
           acquisitions: we assume many orders of magnitude more task target
           breakpoints creations/destructions than CPU target breakpoints.
      
        2. If the target is a CPU: acquires bp_cpuinfo_sem as a writer.
      
      With these changes, contention with thousands of tasks is reduced to the
      point where waiting on locking no longer dominates the profile:
      
       | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
       | # Running 'breakpoint/thread' benchmark:
       | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
       |      Total time: 0.077 [sec]
       |
       |       40.201563 usecs/op
       |     2572.900000 usecs/op/cpu
      
          21.54%  [kernel]       [k] task_bp_pinned
          20.18%  [kernel]       [k] rhashtable_jhash2
           6.81%  [kernel]       [k] toggle_bp_slot
           5.47%  [kernel]       [k] queued_spin_lock_slowpath
           3.75%  [kernel]       [k] smp_cfm_core_cond
           3.48%  [kernel]       [k] bcmp
      
      On this particular setup that's a speedup of 2.7x.
      
      We're also getting closer to the theoretical ideal performance through
      optimizations in hw_breakpoint.c -- constraints accounting disabled:
      
       | perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
       | # Running 'breakpoint/thread' benchmark:
       | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
       |      Total time: 0.067 [sec]
       |
       |       35.286458 usecs/op
       |     2258.333333 usecs/op/cpu
      
      Which means the current implementation is ~12% slower than the
      theoretical ideal.
      
      For reference, performance without any breakpoints:
      
       | $> bench -r 30 breakpoint thread -b 0 -p 64 -t 64
       | # Running 'breakpoint/thread' benchmark:
       | # Created/joined 30 threads with 0 breakpoints and 64 parallelism
       |      Total time: 0.060 [sec]
       |
       |       31.365625 usecs/op
       |     2007.400000 usecs/op/cpu
      
      On a system with 256 CPUs, the theoretical ideal is only ~12% slower
      than no breakpoints at all; the current implementation is ~28% slower.
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Acked-by: default avatarIan Rogers <irogers@google.com>
      Link: https://lore.kernel.org/r/20220829124719.675715-12-elver@google.com
      0912037f
    • Marco Elver's avatar
      locking/percpu-rwsem: Add percpu_is_write_locked() and percpu_is_read_locked() · 01fe8a3f
      Marco Elver authored
      Implement simple accessors to probe percpu-rwsem's locked state:
      percpu_is_write_locked(), percpu_is_read_locked().
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Acked-by: default avatarIan Rogers <irogers@google.com>
      Link: https://lore.kernel.org/r/20220829124719.675715-11-elver@google.com
      01fe8a3f
    • Marco Elver's avatar
      powerpc/hw_breakpoint: Avoid relying on caller synchronization · f95e5a3d
      Marco Elver authored
      Internal data structures (cpu_bps, task_bps) of powerpc's hw_breakpoint
      implementation have relied on nr_bp_mutex serializing access to them.
      
      Before overhauling synchronization of kernel/events/hw_breakpoint.c,
      introduce 2 spinlocks to synchronize cpu_bps and task_bps respectively,
      thus avoiding reliance on callers synchronizing powerpc's hw_breakpoint.
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Acked-by: default avatarIan Rogers <irogers@google.com>
      Link: https://lore.kernel.org/r/20220829124719.675715-10-elver@google.com
      f95e5a3d
    • Marco Elver's avatar
      perf/hw_breakpoint: Remove useless code related to flexible breakpoints · 24198ad3
      Marco Elver authored
      Flexible breakpoints have never been implemented, with
      bp_cpuinfo::flexible always being 0. Unfortunately, they still occupy 4
      bytes in each bp_cpuinfo and bp_busy_slots, as well as computing the max
      flexible count in fetch_bp_busy_slots().
      
      This again causes suboptimal code generation, when we always know that
      `!!slots.flexible` will be 0.
      
      Just get rid of the flexible "placeholder" and remove all real code
      related to it. Make a note in the comment related to the constraints
      algorithm but don't remove them from the algorithm, so that if in future
      flexible breakpoints need supporting, it should be trivial to revive
      them (along with reverting this change).
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Acked-by: default avatarIan Rogers <irogers@google.com>
      Link: https://lore.kernel.org/r/20220829124719.675715-9-elver@google.com
      24198ad3
    • Marco Elver's avatar
      perf/hw_breakpoint: Make hw_breakpoint_weight() inlinable · 9caf87be
      Marco Elver authored
      Due to being a __weak function, hw_breakpoint_weight() will cause the
      compiler to always emit a call to it. This generates unnecessarily bad
      code (register spills etc.) for no good reason; in fact it appears in
      profiles of `perf bench -r 100 breakpoint thread -b 4 -p 128 -t 512`:
      
          ...
          0.70%  [kernel]       [k] hw_breakpoint_weight
          ...
      
      While a small percentage, no architecture defines its own
      hw_breakpoint_weight() nor are there users outside hw_breakpoint.c,
      which makes the fact it is currently __weak a poor choice.
      
      Change hw_breakpoint_weight()'s definition to follow a similar protocol
      to hw_breakpoint_slots(), such that if <asm/hw_breakpoint.h> defines
      hw_breakpoint_weight(), we'll use it instead.
      
      The result is that it is inlined and no longer shows up in profiles.
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Acked-by: default avatarIan Rogers <irogers@google.com>
      Link: https://lore.kernel.org/r/20220829124719.675715-8-elver@google.com
      9caf87be
    • Marco Elver's avatar
      perf/hw_breakpoint: Optimize constant number of breakpoint slots · be3f1525
      Marco Elver authored
      Optimize internal hw_breakpoint state if the architecture's number of
      breakpoint slots is constant. This avoids several kmalloc() calls and
      potentially unnecessary failures if the allocations fail, as well as
      subtly improves code generation and cache locality.
      
      The protocol is that if an architecture defines hw_breakpoint_slots via
      the preprocessor, it must be constant and the same for all types.
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Acked-by: default avatarIan Rogers <irogers@google.com>
      Link: https://lore.kernel.org/r/20220829124719.675715-7-elver@google.com
      be3f1525
    • Marco Elver's avatar
      perf/hw_breakpoint: Mark data __ro_after_init · db5f6f85
      Marco Elver authored
      Mark read-only data after initialization as __ro_after_init.
      
      While we are here, turn 'constraints_initialized' into a bool.
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Acked-by: default avatarIan Rogers <irogers@google.com>
      Link: https://lore.kernel.org/r/20220829124719.675715-6-elver@google.com
      db5f6f85
    • Marco Elver's avatar
      perf/hw_breakpoint: Optimize list of per-task breakpoints · 0370dc31
      Marco Elver authored
      On a machine with 256 CPUs, running the recently added perf breakpoint
      benchmark results in:
      
       | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
       | # Running 'breakpoint/thread' benchmark:
       | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
       |      Total time: 236.418 [sec]
       |
       |   123134.794271 usecs/op
       |  7880626.833333 usecs/op/cpu
      
      The benchmark tests inherited breakpoint perf events across many
      threads.
      
      Looking at a perf profile, we can see that the majority of the time is
      spent in various hw_breakpoint.c functions, which execute within the
      'nr_bp_mutex' critical sections which then results in contention on that
      mutex as well:
      
          37.27%  [kernel]       [k] osq_lock
          34.92%  [kernel]       [k] mutex_spin_on_owner
          12.15%  [kernel]       [k] toggle_bp_slot
          11.90%  [kernel]       [k] __reserve_bp_slot
      
      The culprit here is task_bp_pinned(), which has a runtime complexity of
      O(#tasks) due to storing all task breakpoints in the same list and
      iterating through that list looking for a matching task. Clearly, this
      does not scale to thousands of tasks.
      
      Instead, make use of the "rhashtable" variant "rhltable" which stores
      multiple items with the same key in a list. This results in average
      runtime complexity of O(1) for task_bp_pinned().
      
      With the optimization, the benchmark shows:
      
       | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
       | # Running 'breakpoint/thread' benchmark:
       | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
       |      Total time: 0.208 [sec]
       |
       |      108.422396 usecs/op
       |     6939.033333 usecs/op/cpu
      
      On this particular setup that's a speedup of ~1135x.
      
      While one option would be to make task_struct a breakpoint list node,
      this would only further bloat task_struct for infrequently used data.
      Furthermore, after all optimizations in this series, there's no evidence
      it would result in better performance: later optimizations make the time
      spent looking up entries in the hash table negligible (we'll reach the
      theoretical ideal performance i.e. no constraints).
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Acked-by: default avatarIan Rogers <irogers@google.com>
      Link: https://lore.kernel.org/r/20220829124719.675715-5-elver@google.com
      0370dc31
    • Marco Elver's avatar
      perf/hw_breakpoint: Clean up headers · 089cdcb0
      Marco Elver authored
      Clean up headers:
      
       - Remove unused <linux/kallsyms.h>
      
       - Remove unused <linux/kprobes.h>
      
       - Remove unused <linux/module.h>
      
       - Remove unused <linux/smp.h>
      
       - Add <linux/export.h> for EXPORT_SYMBOL_GPL().
      
       - Add <linux/mutex.h> for mutex.
      
       - Sort alphabetically.
      
       - Move <linux/hw_breakpoint.h> to top to test it compiles on its own.
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Acked-by: default avatarIan Rogers <irogers@google.com>
      Link: https://lore.kernel.org/r/20220829124719.675715-4-elver@google.com
      089cdcb0