- 30 Aug, 2022 11 commits
-
-
Marco Elver authored
While optimizing task_bp_pinned()'s runtime complexity to O(1) on average helps reduce time spent in the critical section, we still suffer due to serializing everything via 'nr_bp_mutex'. Indeed, a profile shows that now contention is the biggest issue: 95.93% [kernel] [k] osq_lock 0.70% [kernel] [k] mutex_spin_on_owner 0.22% [kernel] [k] smp_cfm_core_cond 0.18% [kernel] [k] task_bp_pinned 0.18% [kernel] [k] rhashtable_jhash2 0.15% [kernel] [k] queued_spin_lock_slowpath when running the breakpoint benchmark with (system with 256 CPUs): | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64 | # Running 'breakpoint/thread' benchmark: | # Created/joined 30 threads with 4 breakpoints and 64 parallelism | Total time: 0.207 [sec] | | 108.267188 usecs/op | 6929.100000 usecs/op/cpu The main concern for synchronizing the breakpoint constraints data is that a consistent snapshot of the per-CPU and per-task data is observed. The access pattern is as follows: 1. If the target is a task: the task's pinned breakpoints are counted, checked for space, and then appended to; only bp_cpuinfo::cpu_pinned is used to check for conflicts with CPU-only breakpoints; bp_cpuinfo::tsk_pinned are incremented/decremented, but otherwise unused. 2. If the target is a CPU: bp_cpuinfo::cpu_pinned are counted, along with bp_cpuinfo::tsk_pinned; after a successful check, cpu_pinned is incremented. No per-task breakpoints are checked. Since rhltable safely synchronizes insertions/deletions, we can allow concurrency as follows: 1. If the target is a task: independent tasks may update and check the constraints concurrently, but same-task target calls need to be serialized; since bp_cpuinfo::tsk_pinned is only updated, but not checked, these modifications can happen concurrently by switching tsk_pinned to atomic_t. 2. If the target is a CPU: access to the per-CPU constraints needs to be serialized with other CPU-target and task-target callers (to stabilize the bp_cpuinfo::tsk_pinned snapshot). We can allow the above concurrency by introducing a per-CPU constraints data reader-writer lock (bp_cpuinfo_sem), and per-task mutexes (reuses task_struct::perf_event_mutex): 1. If the target is a task: acquires perf_event_mutex, and acquires bp_cpuinfo_sem as a reader. The choice of percpu-rwsem minimizes contention in the presence of many read-lock but few write-lock acquisitions: we assume many orders of magnitude more task target breakpoints creations/destructions than CPU target breakpoints. 2. If the target is a CPU: acquires bp_cpuinfo_sem as a writer. With these changes, contention with thousands of tasks is reduced to the point where waiting on locking no longer dominates the profile: | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64 | # Running 'breakpoint/thread' benchmark: | # Created/joined 30 threads with 4 breakpoints and 64 parallelism | Total time: 0.077 [sec] | | 40.201563 usecs/op | 2572.900000 usecs/op/cpu 21.54% [kernel] [k] task_bp_pinned 20.18% [kernel] [k] rhashtable_jhash2 6.81% [kernel] [k] toggle_bp_slot 5.47% [kernel] [k] queued_spin_lock_slowpath 3.75% [kernel] [k] smp_cfm_core_cond 3.48% [kernel] [k] bcmp On this particular setup that's a speedup of 2.7x. We're also getting closer to the theoretical ideal performance through optimizations in hw_breakpoint.c -- constraints accounting disabled: | perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64 | # Running 'breakpoint/thread' benchmark: | # Created/joined 30 threads with 4 breakpoints and 64 parallelism | Total time: 0.067 [sec] | | 35.286458 usecs/op | 2258.333333 usecs/op/cpu Which means the current implementation is ~12% slower than the theoretical ideal. For reference, performance without any breakpoints: | $> bench -r 30 breakpoint thread -b 0 -p 64 -t 64 | # Running 'breakpoint/thread' benchmark: | # Created/joined 30 threads with 0 breakpoints and 64 parallelism | Total time: 0.060 [sec] | | 31.365625 usecs/op | 2007.400000 usecs/op/cpu On a system with 256 CPUs, the theoretical ideal is only ~12% slower than no breakpoints at all; the current implementation is ~28% slower. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20220829124719.675715-12-elver@google.com
-
Marco Elver authored
Implement simple accessors to probe percpu-rwsem's locked state: percpu_is_write_locked(), percpu_is_read_locked(). Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20220829124719.675715-11-elver@google.com
-
Marco Elver authored
Internal data structures (cpu_bps, task_bps) of powerpc's hw_breakpoint implementation have relied on nr_bp_mutex serializing access to them. Before overhauling synchronization of kernel/events/hw_breakpoint.c, introduce 2 spinlocks to synchronize cpu_bps and task_bps respectively, thus avoiding reliance on callers synchronizing powerpc's hw_breakpoint. Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20220829124719.675715-10-elver@google.com
-
Marco Elver authored
Flexible breakpoints have never been implemented, with bp_cpuinfo::flexible always being 0. Unfortunately, they still occupy 4 bytes in each bp_cpuinfo and bp_busy_slots, as well as computing the max flexible count in fetch_bp_busy_slots(). This again causes suboptimal code generation, when we always know that `!!slots.flexible` will be 0. Just get rid of the flexible "placeholder" and remove all real code related to it. Make a note in the comment related to the constraints algorithm but don't remove them from the algorithm, so that if in future flexible breakpoints need supporting, it should be trivial to revive them (along with reverting this change). Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20220829124719.675715-9-elver@google.com
-
Marco Elver authored
Due to being a __weak function, hw_breakpoint_weight() will cause the compiler to always emit a call to it. This generates unnecessarily bad code (register spills etc.) for no good reason; in fact it appears in profiles of `perf bench -r 100 breakpoint thread -b 4 -p 128 -t 512`: ... 0.70% [kernel] [k] hw_breakpoint_weight ... While a small percentage, no architecture defines its own hw_breakpoint_weight() nor are there users outside hw_breakpoint.c, which makes the fact it is currently __weak a poor choice. Change hw_breakpoint_weight()'s definition to follow a similar protocol to hw_breakpoint_slots(), such that if <asm/hw_breakpoint.h> defines hw_breakpoint_weight(), we'll use it instead. The result is that it is inlined and no longer shows up in profiles. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20220829124719.675715-8-elver@google.com
-
Marco Elver authored
Optimize internal hw_breakpoint state if the architecture's number of breakpoint slots is constant. This avoids several kmalloc() calls and potentially unnecessary failures if the allocations fail, as well as subtly improves code generation and cache locality. The protocol is that if an architecture defines hw_breakpoint_slots via the preprocessor, it must be constant and the same for all types. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20220829124719.675715-7-elver@google.com
-
Marco Elver authored
Mark read-only data after initialization as __ro_after_init. While we are here, turn 'constraints_initialized' into a bool. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20220829124719.675715-6-elver@google.com
-
Marco Elver authored
On a machine with 256 CPUs, running the recently added perf breakpoint benchmark results in: | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64 | # Running 'breakpoint/thread' benchmark: | # Created/joined 30 threads with 4 breakpoints and 64 parallelism | Total time: 236.418 [sec] | | 123134.794271 usecs/op | 7880626.833333 usecs/op/cpu The benchmark tests inherited breakpoint perf events across many threads. Looking at a perf profile, we can see that the majority of the time is spent in various hw_breakpoint.c functions, which execute within the 'nr_bp_mutex' critical sections which then results in contention on that mutex as well: 37.27% [kernel] [k] osq_lock 34.92% [kernel] [k] mutex_spin_on_owner 12.15% [kernel] [k] toggle_bp_slot 11.90% [kernel] [k] __reserve_bp_slot The culprit here is task_bp_pinned(), which has a runtime complexity of O(#tasks) due to storing all task breakpoints in the same list and iterating through that list looking for a matching task. Clearly, this does not scale to thousands of tasks. Instead, make use of the "rhashtable" variant "rhltable" which stores multiple items with the same key in a list. This results in average runtime complexity of O(1) for task_bp_pinned(). With the optimization, the benchmark shows: | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64 | # Running 'breakpoint/thread' benchmark: | # Created/joined 30 threads with 4 breakpoints and 64 parallelism | Total time: 0.208 [sec] | | 108.422396 usecs/op | 6939.033333 usecs/op/cpu On this particular setup that's a speedup of ~1135x. While one option would be to make task_struct a breakpoint list node, this would only further bloat task_struct for infrequently used data. Furthermore, after all optimizations in this series, there's no evidence it would result in better performance: later optimizations make the time spent looking up entries in the hash table negligible (we'll reach the theoretical ideal performance i.e. no constraints). Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20220829124719.675715-5-elver@google.com
-
Marco Elver authored
Clean up headers: - Remove unused <linux/kallsyms.h> - Remove unused <linux/kprobes.h> - Remove unused <linux/module.h> - Remove unused <linux/smp.h> - Add <linux/export.h> for EXPORT_SYMBOL_GPL(). - Add <linux/mutex.h> for mutex. - Sort alphabetically. - Move <linux/hw_breakpoint.h> to top to test it compiles on its own. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20220829124719.675715-4-elver@google.com
-
Marco Elver authored
Provide hw_breakpoint_is_used() to check if breakpoints are in use on the system. Use it in the KUnit test to verify the global state before and after a test case. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20220829124719.675715-3-elver@google.com
-
Marco Elver authored
Add KUnit test for hw_breakpoint constraints accounting, with various interesting mixes of breakpoint targets (some care was taken to catch interesting corner cases via bug-injection). The test cannot be built as a module because it requires access to hw_breakpoint_slots(), which is not inlinable or exported on all architectures. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20220829124719.675715-2-elver@google.com
-
- 29 Aug, 2022 4 commits
-
-
Anshuman Khandual authored
BRBE captured branch types will overflow perf_branch_entry.type and generic branch types in perf_branch_entry.new_type. So override each available arch specific branch type in the following manner to comprehensively process all reported branch types in BRBE. PERF_BR_ARM64_FIQ PERF_BR_NEW_ARCH_1 PERF_BR_ARM64_DEBUG_HALT PERF_BR_NEW_ARCH_2 PERF_BR_ARM64_DEBUG_EXIT PERF_BR_NEW_ARCH_3 PERF_BR_ARM64_DEBUG_INST PERF_BR_NEW_ARCH_4 PERF_BR_ARM64_DEBUG_DATA PERF_BR_NEW_ARCH_5 Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: James Clark <james.clark@arm.com> Link: https://lkml.kernel.org/r/20220824044822.70230-5-anshuman.khandual@arm.com
-
Anshuman Khandual authored
Platforms like arm64 could capture privilege level information for all the branch records. Hence this adds a new element in the struct branch_entry to record the privilege level information, which could be requested through a new event.attr.branch_sample_type based flag PERF_SAMPLE_BRANCH_PRIV_SAVE. This flag helps user choose whether privilege information is captured. Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: James Clark <james.clark@arm.com> Link: https://lkml.kernel.org/r/20220824044822.70230-4-anshuman.khandual@arm.com
-
Anshuman Khandual authored
branch_entry.type now has ran out of space to accommodate more branch types classification. This will prevent perf branch stack implementation on arm64 (via BRBE) to capture all available branch types. Extending this bit field i.e branch_entry.type [4 bits] is not an option as it will break user space ABI both for little and big endian perf tools. Extend branch classification with a new field branch_entry.new_type via a new branch type PERF_BR_EXTEND_ABI in branch_entry.type. Perf tools which could decode PERF_BR_EXTEND_ABI, will then parse branch_entry.new_type as well. branch_entry.new_type is a 4 bit field which can hold upto 16 branch types. The first three branch types will hold various generic page faults followed by five architecture specific branch types, which can be overridden by the platform for specific use cases. These architecture specific branch types gets overridden on arm64 platform for BRBE implementation. New generic branch types - PERF_BR_NEW_FAULT_ALGN - PERF_BR_NEW_FAULT_DATA - PERF_BR_NEW_FAULT_INST New arch specific branch types - PERF_BR_NEW_ARCH_1 - PERF_BR_NEW_ARCH_2 - PERF_BR_NEW_ARCH_3 - PERF_BR_NEW_ARCH_4 - PERF_BR_NEW_ARCH_5 Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: James Clark <james.clark@arm.com> Link: https://lkml.kernel.org/r/20220824044822.70230-3-anshuman.khandual@arm.com
-
Anshuman Khandual authored
This expands generic branch type classification by adding two more entries there in i.e system error and not in transaction. This also updates the x86 implementation to process X86_BR_NO_TX records as appropriate. This changes branch types reported to user space on x86 platform but it should not be a problem. The possible scenarios and impacts are enumerated here. -------------------------------------------------------------------------- | kernel | perf tool | Impact | -------------------------------------------------------------------------- | old | old | Works as before | -------------------------------------------------------------------------- | old | new | PERF_BR_UNKNOWN is processed | -------------------------------------------------------------------------- | new | old | PERF_BR_NO_TX is blocked via old PERF_BR_MAX | -------------------------------------------------------------------------- | new | new | PERF_BR_NO_TX is recognized | -------------------------------------------------------------------------- When PERF_BR_NO_TX is blocked via old PERF_BR_MAX (new kernel with old perf tool) the user space might throw up an warning complaining about an unrecognized branch types being reported, but it's expected. PERF_BR_SERROR & PERF_BR_NO_TX branch types will be used for BRBE implementation on arm64 platform. PERF_BR_NO_TX complements 'abort' and 'in_tx' elements in perf_branch_entry which represent other transaction states for a given branch record. Because this completes the transaction state classification. Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: James Clark <james.clark@arm.com> Link: https://lkml.kernel.org/r/20220824044822.70230-2-anshuman.khandual@arm.com
-
- 26 Aug, 2022 13 commits
-
-
Sandipan Das authored
Provide branch speculation information captured via AMD Last Branch Record Extension Version 2 (LbrExtV2) by setting the speculation info in branch records. The info is based on the "valid" and "spec" bits in the Branch To registers. Suggested-by: Stephane Eranian <eranian@google.com> Signed-off-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/ddc02f6320464cad0e3ff5bdb2314531568a91bc.1660211399.git.sandipan.das@amd.com
-
Sandipan Das authored
Add a new "spec" bitfield to branch entries for providing speculation information. This will be populated using hints provided by branch sampling features on supported hardware. The following cases are covered: * No branch speculation information is available * Branch is speculative but taken on the wrong path * Branch is non-speculative but taken on the correct path * Branch is speculative and taken on the correct path Suggested-by: Stephane Eranian <eranian@google.com> Signed-off-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/834088c302faf21c7b665031dd111f424e509a64.1660211399.git.sandipan.das@amd.com
-
Sandipan Das authored
AMD Last Branch Record Extension Version 2 (LbrExtV2) can report a branch from address that points to an instruction preceding the actual branch by several bytes due to branch fusion and further optimizations in Zen4 processors. In such cases, software should move forward sequentially in the instruction stream from the reported address and the address of the first branch encountered should be used instead. Hence, use the fusion-aware branch classifier to determine the correct branch type and get the offset for adjusting the branch from address. Signed-off-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/c324d2d0a9c3976da30b9563d09e50bfee0f264d.1660211399.git.sandipan.das@amd.com
-
Sandipan Das authored
With branch fusion and other optimizations, branch sampling hardware in some processors can report a branch from address that points to an instruction preceding the actual branch by several bytes. In such cases, the classifier cannot determine the branch type which leads to failures such as with the recently added test from commit b55878c9 ("perf test: Add test for branch stack sampling"). Branch information is also easier to consume and annotate if branch from addresses always point to branch instructions. Add a new variant of the branch classifier that can account for instruction fusion. If fusion is expected and the current branch from address does not point to a branch instruction, it attempts to find the first branch within the next (MAX_INSN_SIZE - 1) bytes and if found, additionally provides the offset between the reported branch from address and the address of the expected branch instruction. Signed-off-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/b6bb0abaa8a54c0b6d716344700ee11a1793d709.1660211399.git.sandipan.das@amd.com
-
Sandipan Das authored
With AMD Last Branch Record Extension Version 2 (LbrExtV2), it is necessary to process the branch records further as hardware filtering is not granular enough for identifying certain types of branches. E.g. to record system calls, one should record far branches. The filter captures both far calls and far returns but the irrelevant records are filtered out based on the branch type as seen by the branch classifier. Signed-off-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/e51de057517f77788abd393c832e8dea616d489c.1660211399.git.sandipan.das@amd.com
-
Sandipan Das authored
Commit 3e702ff6 ("perf/x86: Add LBR software filter support for Intel CPUs") introduces a software branch filter which complements the hardware branch filter and adds an x86 branch classifier. Move the branch classifier to arch/x86/events/ so that it can be utilized by other vendors for branch record filtering. Signed-off-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/bae5b95470d6bd49f40954bd379f414f5afcb965.1660211399.git.sandipan.das@amd.com
-
Sandipan Das authored
If AMD Last Branch Record Extension Version 2 (LbrExtV2) is detected, convert the requested branch filter (PERF_SAMPLE_BRANCH_* flags) to the corresponding hardware filter value and stash it in the event data when a branch stack is requested. The hardware filter value is also saved in per-CPU areas for use during event scheduling. Hardware filtering is provided by the LBR Branch Select register. It has bits which when set, suppress recording of the following types of branches: * CPL = 0 (Kernel only) * CPL > 0 (Userspace only) * Conditional Branches * Near Relative Calls * Near Indirect Calls * Near Returns * Near Indirect Jumps (excluding Near Indirect Calls and Near Returns) * Near Relative Jumps (excluding Near Relative Calls) * Far Branches Signed-off-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/9336af5c9785b8e14c62220fc0e6cfb10ab97de3.1660211399.git.sandipan.das@amd.com
-
Sandipan Das authored
If AMD Last Branch Record Extension Version 2 (LbrExtV2) is detected, enable it alongside LBR Freeze on PMI when an event requests branch stack i.e. PERF_SAMPLE_BRANCH_STACK. Each branch record is represented by a pair of registers, LBR From and LBR To. The freeze feature prevents any updates to these registers once a PMC overflows. The contents remain unchanged until the freeze bit is cleared by the PMI handler. The branch records are read and copied to sample data before unfreezing. However, only valid entries are copied. There is no additional register to denote which of the register pairs represent the top of the stack (TOS) since internal register renaming always ensures that the first pair (i.e. index 0) is the one representing the most recent branch and so on. The LBR registers are per-thread resources and are cleared explicitly whenever a new task is scheduled in. There are no special implications on the contents of these registers when transitioning to deep C-states. Signed-off-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/d3b8500a3627a0d4d0259b005891ee248f248d91.1660211399.git.sandipan.das@amd.com
-
Sandipan Das authored
AMD Last Branch Record Extension Version 2 (LbrExtV2) is driven by Core PMC overflows. It records recently taken branches up to the moment when the PMC overflow occurs. Detect the feature during PMU initialization and set the branch stack depth using CPUID leaf 0x80000022 EBX. Signed-off-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/fc6e45378ada258f1bab79b0de6e05c393a8f1dd.1660211399.git.sandipan.das@amd.com
-
Sandipan Das authored
CPUID leaf 0x80000022 i.e. ExtPerfMonAndDbg advertises some new performance monitoring features for AMD processors. Bit 1 of EAX indicates support for Last Branch Record Extension Version 2 (LbrExtV2) features. If found to be set during PMU initialization, the EBX bits of the same leaf can be used to determine the number of available LBR entries. For better utilization of feature words, LbrExtV2 is added as a scattered feature bit. [peterz: Rename to AMD_LBR_V2] Signed-off-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/172d2b0df39306ed77221c45ee1aa62e8ae0548d.1660211399.git.sandipan.das@amd.com
-
Sandipan Das authored
AMD processors that are capable of recording branches support either Branch Sampling (BRS) or Last Branch Record (LBR). In preparation for adding Last Branch Record Extension Version 2 (LbrExtV2) support, introduce new static calls which act as gateways to call into the feature-dependent functions based on what is available on the processor. Signed-off-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/b75dbc32663cb395f0d701167e952c6a6b0445a3.1660211399.git.sandipan.das@amd.com
-
Sandipan Das authored
AMD processors that are capable of recording branches support either Branch Sampling (BRS) or Last Branch Record (LBR). In preparation for adding Last Branch Record Extension Version 2 (LbrExtV2) support, reuse the "branches" capability to advertise information about both BRS and LBR but make the "branch-brs" event exclusive to Family 19h processors that support BRS. Signed-off-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/ba4a4cde6db79b1c65c49834027bbdb8a915546b.1660211399.git.sandipan.das@amd.com
-
Sandipan Das authored
Move some of the Branch Sampling (BRS) specific functions out of the Core events sources and into the BRS sources in preparation for adding other mechanisms to record branches. Signed-off-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/b60283b57179475d18ee242d117c335c16733693.1660211399.git.sandipan.das@amd.com
-
- 22 Aug, 2022 1 commit
-
-
Linus Torvalds authored
-
- 21 Aug, 2022 11 commits
-
-
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds authored
Pull irq fixes from Ingo Molnar: "Misc irqchip fixes: LoongArch driver fixes and a Hyper-V IOMMU fix" * tag 'irq-urgent-2022-08-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip/loongson-liointc: Fix an error handling path in liointc_init() irqchip/loongarch: Fix irq_domain_alloc_fwnode() abuse irqchip/loongson-pch-pic: Move find_pch_pic() into CONFIG_ACPI irqchip/loongson-eiointc: Fix a build warning irqchip/loongson-eiointc: Fix irq affinity setting iommu/hyper-v: Use helper instead of directly accessing affinity
-
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds authored
Pull x86 kprobes fix from Ingo Molnar: "Fix a kprobes bug in JNG/JNLE emulation when a kprobe is installed at such instructions, possibly resulting in incorrect execution (the wrong branch taken)" * tag 'perf-urgent-2022-08-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/kprobes: Fix JNG/JNLE emulation
-
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-traceLinus Torvalds authored
Pull tracing fixes from Steven Rostedt: "Various fixes for tracing: - Fix a return value of traceprobe_parse_event_name() - Fix NULL pointer dereference from failed ftrace enabling - Fix NULL pointer dereference when asking for registers from eprobes - Make eprobes consistent with kprobes/uprobes, filters and histograms" * tag 'trace-v6.0-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Have filter accept "common_cpu" to be consistent tracing/probes: Have kprobes and uprobes use $COMM too tracing/eprobes: Have event probes be consistent with kprobes and uprobes tracing/eprobes: Fix reading of string fields tracing/eprobes: Do not hardcode $comm as a string tracing/eprobes: Do not allow eprobes to use $stack, or % for regs ftrace: Fix NULL pointer dereference in is_ftrace_trampoline when ftrace is dead tracing/perf: Fix double put of trace event when init fails tracing: React to error return from traceprobe_parse_event_name()
-
Steven Rostedt (Google) authored
Make filtering consistent with histograms. As "cpu" can be a field of an event, allow for "common_cpu" to keep it from being confused with the "cpu" field of the event. Link: https://lkml.kernel.org/r/20220820134401.513062765@goodmis.org Link: https://lore.kernel.org/all/20220820220920.e42fa32b70505b1904f0a0ad@kernel.org/ Cc: stable@vger.kernel.org Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com> Cc: Tom Zanussi <zanussi@kernel.org> Fixes: 1e3bac71 ("tracing/histogram: Rename "cpu" to "common_cpu"") Suggested-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
-
Steven Rostedt (Google) authored
Both $comm and $COMM can be used to get current->comm in eprobes and the filtering and histogram logic. Make kprobes and uprobes consistent in this regard and allow both $comm and $COMM as well. Currently kprobes and uprobes only handle $comm, which is inconsistent with the other utilities, and can be confusing to users. Link: https://lkml.kernel.org/r/20220820134401.317014913@goodmis.org Link: https://lore.kernel.org/all/20220820220442.776e1ddaf8836e82edb34d01@kernel.org/ Cc: stable@vger.kernel.org Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com> Cc: Tom Zanussi <zanussi@kernel.org> Fixes: 53305928 ("tracing: probeevent: Introduce new argument fetching code") Suggested-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
-
Steven Rostedt (Google) authored
Currently, if a symbol "@" is attempted to be used with an event probe (eprobes), it will cause a NULL pointer dereference crash. Both kprobes and uprobes can reference data other than the main registers. Such as immediate address, symbols and the current task name. Have eprobes do the same thing. For "comm", if "comm" is used and the event being attached to does not have the "comm" field, then make it the "$comm" that kprobes has. This is consistent to the way histograms and filters work. Link: https://lkml.kernel.org/r/20220820134401.136924220@goodmis.org Cc: stable@vger.kernel.org Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com> Cc: Tom Zanussi <zanussi@kernel.org> Fixes: 7491e2c4 ("tracing: Add a probe that attaches to trace events") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
-
Steven Rostedt (Google) authored
Currently when an event probe (eprobe) hooks to a string field, it does not display it as a string, but instead as a number. This makes the field rather useless. Handle the different kinds of strings, dynamic, static, relational/dynamic etc. Now when a string field is used, the ":string" type can be used to display it: echo "e:sw sched/sched_switch comm=$next_comm:string" > dynamic_events Link: https://lkml.kernel.org/r/20220820134400.959640191@goodmis.org Cc: stable@vger.kernel.org Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com> Cc: Tom Zanussi <zanussi@kernel.org> Fixes: 7491e2c4 ("tracing: Add a probe that attaches to trace events") Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
-
Steven Rostedt (Google) authored
The variable $comm is hard coded as a string, which is true for both kprobes and uprobes, but for event probes (eprobes) it is a field name. In most cases the "comm" field would be a string, but there's no guarantee of that fact. Do not assume that comm is a string. Not to mention, it currently forces comm fields to fault, as string processing for event probes is currently broken. Link: https://lkml.kernel.org/r/20220820134400.756152112@goodmis.org Cc: stable@vger.kernel.org Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com> Cc: Tom Zanussi <zanussi@kernel.org> Fixes: 7491e2c4 ("tracing: Add a probe that attaches to trace events") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
-
Steven Rostedt (Google) authored
While playing with event probes (eprobes), I tried to see what would happen if I attempted to retrieve the instruction pointer (%rip) knowing that event probes do not use pt_regs. The result was: BUG: kernel NULL pointer dereference, address: 0000000000000024 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 1 PID: 1847 Comm: trace-cmd Not tainted 5.19.0-rc5-test+ #309 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016 RIP: 0010:get_event_field.isra.0+0x0/0x50 Code: ff 48 c7 c7 c0 8f 74 a1 e8 3d 8b f5 ff e8 88 09 f6 ff 4c 89 e7 e8 50 6a 13 00 48 89 ef 5b 5d 41 5c 41 5d e9 42 6a 13 00 66 90 <48> 63 47 24 8b 57 2c 48 01 c6 8b 47 28 83 f8 02 74 0e 83 f8 04 74 RSP: 0018:ffff916c394bbaf0 EFLAGS: 00010086 RAX: ffff916c854041d8 RBX: ffff916c8d9fbf50 RCX: ffff916c255d2000 RDX: 0000000000000000 RSI: ffff916c255d2008 RDI: 0000000000000000 RBP: 0000000000000000 R08: ffff916c3a2a0c08 R09: ffff916c394bbda8 R10: 0000000000000000 R11: 0000000000000000 R12: ffff916c854041d8 R13: ffff916c854041b0 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff916c9ea40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000024 CR3: 000000011b60a002 CR4: 00000000001706e0 Call Trace: <TASK> get_eprobe_size+0xb4/0x640 ? __mod_node_page_state+0x72/0xc0 __eprobe_trace_func+0x59/0x1a0 ? __mod_lruvec_page_state+0xaa/0x1b0 ? page_remove_file_rmap+0x14/0x230 ? page_remove_rmap+0xda/0x170 event_triggers_call+0x52/0xe0 trace_event_buffer_commit+0x18f/0x240 trace_event_raw_event_sched_wakeup_template+0x7a/0xb0 try_to_wake_up+0x260/0x4c0 __wake_up_common+0x80/0x180 __wake_up_common_lock+0x7c/0xc0 do_notify_parent+0x1c9/0x2a0 exit_notify+0x1a9/0x220 do_exit+0x2ba/0x450 do_group_exit+0x2d/0x90 __x64_sys_exit_group+0x14/0x20 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x46/0xb0 Obviously this is not the desired result. Move the testing for TPARG_FL_TPOINT which is only used for event probes to the top of the "$" variable check, as all the other variables are not used for event probes. Also add a check in the register parsing "%" to fail if an event probe is used. Link: https://lkml.kernel.org/r/20220820134400.564426983@goodmis.org Cc: stable@vger.kernel.org Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com> Cc: Tom Zanussi <zanussi@kernel.org> Fixes: 7491e2c4 ("tracing: Add a probe that attaches to trace events") Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
-
Yang Jihong authored
ftrace_startup does not remove ops from ftrace_ops_list when ftrace_startup_enable fails: register_ftrace_function ftrace_startup __register_ftrace_function ... add_ftrace_ops(&ftrace_ops_list, ops) ... ... ftrace_startup_enable // if ftrace failed to modify, ftrace_disabled is set to 1 ... return 0 // ops is in the ftrace_ops_list. When ftrace_disabled = 1, unregister_ftrace_function simply returns without doing anything: unregister_ftrace_function ftrace_shutdown if (unlikely(ftrace_disabled)) return -ENODEV; // return here, __unregister_ftrace_function is not executed, // as a result, ops is still in the ftrace_ops_list __unregister_ftrace_function ... If ops is dynamically allocated, it will be free later, in this case, is_ftrace_trampoline accesses NULL pointer: is_ftrace_trampoline ftrace_ops_trampoline do_for_each_ftrace_op(op, ftrace_ops_list) // OOPS! op may be NULL! Syzkaller reports as follows: [ 1203.506103] BUG: kernel NULL pointer dereference, address: 000000000000010b [ 1203.508039] #PF: supervisor read access in kernel mode [ 1203.508798] #PF: error_code(0x0000) - not-present page [ 1203.509558] PGD 800000011660b067 P4D 800000011660b067 PUD 130fb8067 PMD 0 [ 1203.510560] Oops: 0000 [#1] SMP KASAN PTI [ 1203.511189] CPU: 6 PID: 29532 Comm: syz-executor.2 Tainted: G B W 5.10.0 #8 [ 1203.512324] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [ 1203.513895] RIP: 0010:is_ftrace_trampoline+0x26/0xb0 [ 1203.514644] Code: ff eb d3 90 41 55 41 54 49 89 fc 55 53 e8 f2 00 fd ff 48 8b 1d 3b 35 5d 03 e8 e6 00 fd ff 48 8d bb 90 00 00 00 e8 2a 81 26 00 <48> 8b ab 90 00 00 00 48 85 ed 74 1d e8 c9 00 fd ff 48 8d bb 98 00 [ 1203.518838] RSP: 0018:ffffc900012cf960 EFLAGS: 00010246 [ 1203.520092] RAX: 0000000000000000 RBX: 000000000000007b RCX: ffffffff8a331866 [ 1203.521469] RDX: 0000000000000000 RSI: 0000000000000008 RDI: 000000000000010b [ 1203.522583] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffffff8df18b07 [ 1203.523550] R10: fffffbfff1be3160 R11: 0000000000000001 R12: 0000000000478399 [ 1203.524596] R13: 0000000000000000 R14: ffff888145088000 R15: 0000000000000008 [ 1203.525634] FS: 00007f429f5f4700(0000) GS:ffff8881daf00000(0000) knlGS:0000000000000000 [ 1203.526801] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1203.527626] CR2: 000000000000010b CR3: 0000000170e1e001 CR4: 00000000003706e0 [ 1203.528611] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1203.529605] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Therefore, when ftrace_startup_enable fails, we need to rollback registration process and remove ops from ftrace_ops_list. Link: https://lkml.kernel.org/r/20220818032659.56209-1-yangjihong1@huawei.comSuggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Yang Jihong <yangjihong1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
-
Steven Rostedt (Google) authored
If in perf_trace_event_init(), the perf_trace_event_open() fails, then it will call perf_trace_event_unreg() which will not only unregister the perf trace event, but will also call the put() function of the tp_event. The problem here is that the trace_event_try_get_ref() is called by the caller of perf_trace_event_init() and if perf_trace_event_init() returns a failure, it will then call trace_event_put(). But since the perf_trace_event_unreg() already called the trace_event_put() function, it triggers a WARN_ON(). WARNING: CPU: 1 PID: 30309 at kernel/trace/trace_dynevent.c:46 trace_event_dyn_put_ref+0x15/0x20 If perf_trace_event_reg() does not call the trace_event_try_get_ref() then the perf_trace_event_unreg() should not be calling trace_event_put(). This breaks symmetry and causes bugs like these. Pull out the trace_event_put() from perf_trace_event_unreg() and call it in the locations that perf_trace_event_unreg() is called. This not only fixes this bug, but also brings back the proper symmetry of the reg/unreg vs get/put logic. Link: https://lore.kernel.org/all/cover.1660347763.git.kjlx@templeofstupid.com/ Link: https://lkml.kernel.org/r/20220816192817.43d5e17f@gandalf.local.home Cc: stable@vger.kernel.org Fixes: 1d18538e ("tracing: Have dynamic events have a ref counter") Reported-by: Krister Johansen <kjlx@templeofstupid.com> Reviewed-by: Krister Johansen <kjlx@templeofstupid.com> Tested-by: Krister Johansen <kjlx@templeofstupid.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
-