• Marco Elver's avatar
    perf/hw_breakpoint: Optimize list of per-task breakpoints · 0370dc31
    Marco Elver authored
    On a machine with 256 CPUs, running the recently added perf breakpoint
    benchmark results in:
    
     | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
     | # Running 'breakpoint/thread' benchmark:
     | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
     |      Total time: 236.418 [sec]
     |
     |   123134.794271 usecs/op
     |  7880626.833333 usecs/op/cpu
    
    The benchmark tests inherited breakpoint perf events across many
    threads.
    
    Looking at a perf profile, we can see that the majority of the time is
    spent in various hw_breakpoint.c functions, which execute within the
    'nr_bp_mutex' critical sections which then results in contention on that
    mutex as well:
    
        37.27%  [kernel]       [k] osq_lock
        34.92%  [kernel]       [k] mutex_spin_on_owner
        12.15%  [kernel]       [k] toggle_bp_slot
        11.90%  [kernel]       [k] __reserve_bp_slot
    
    The culprit here is task_bp_pinned(), which has a runtime complexity of
    O(#tasks) due to storing all task breakpoints in the same list and
    iterating through that list looking for a matching task. Clearly, this
    does not scale to thousands of tasks.
    
    Instead, make use of the "rhashtable" variant "rhltable" which stores
    multiple items with the same key in a list. This results in average
    runtime complexity of O(1) for task_bp_pinned().
    
    With the optimization, the benchmark shows:
    
     | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
     | # Running 'breakpoint/thread' benchmark:
     | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
     |      Total time: 0.208 [sec]
     |
     |      108.422396 usecs/op
     |     6939.033333 usecs/op/cpu
    
    On this particular setup that's a speedup of ~1135x.
    
    While one option would be to make task_struct a breakpoint list node,
    this would only further bloat task_struct for infrequently used data.
    Furthermore, after all optimizations in this series, there's no evidence
    it would result in better performance: later optimizations make the time
    spent looking up entries in the hash table negligible (we'll reach the
    theoretical ideal performance i.e. no constraints).
    Signed-off-by: default avatarMarco Elver <elver@google.com>
    Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: default avatarDmitry Vyukov <dvyukov@google.com>
    Acked-by: default avatarIan Rogers <irogers@google.com>
    Link: https://lore.kernel.org/r/20220829124719.675715-5-elver@google.com
    0370dc31
hw_breakpoint.c 17.4 KB