• Marco Elver's avatar
    perf/hw_breakpoint: Reduce contention with large number of tasks · 0912037f
    Marco Elver authored
    While optimizing task_bp_pinned()'s runtime complexity to O(1) on
    average helps reduce time spent in the critical section, we still suffer
    due to serializing everything via 'nr_bp_mutex'. Indeed, a profile shows
    that now contention is the biggest issue:
    
        95.93%  [kernel]       [k] osq_lock
         0.70%  [kernel]       [k] mutex_spin_on_owner
         0.22%  [kernel]       [k] smp_cfm_core_cond
         0.18%  [kernel]       [k] task_bp_pinned
         0.18%  [kernel]       [k] rhashtable_jhash2
         0.15%  [kernel]       [k] queued_spin_lock_slowpath
    
    when running the breakpoint benchmark with (system with 256 CPUs):
    
     | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
     | # Running 'breakpoint/thread' benchmark:
     | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
     |      Total time: 0.207 [sec]
     |
     |      108.267188 usecs/op
     |     6929.100000 usecs/op/cpu
    
    The main concern for synchronizing the breakpoint constraints data is
    that a consistent snapshot of the per-CPU and per-task data is observed.
    
    The access pattern is as follows:
    
     1. If the target is a task: the task's pinned breakpoints are counted,
        checked for space, and then appended to; only bp_cpuinfo::cpu_pinned
        is used to check for conflicts with CPU-only breakpoints;
        bp_cpuinfo::tsk_pinned are incremented/decremented, but otherwise
        unused.
    
     2. If the target is a CPU: bp_cpuinfo::cpu_pinned are counted, along
        with bp_cpuinfo::tsk_pinned; after a successful check, cpu_pinned is
        incremented. No per-task breakpoints are checked.
    
    Since rhltable safely synchronizes insertions/deletions, we can allow
    concurrency as follows:
    
     1. If the target is a task: independent tasks may update and check the
        constraints concurrently, but same-task target calls need to be
        serialized; since bp_cpuinfo::tsk_pinned is only updated, but not
        checked, these modifications can happen concurrently by switching
        tsk_pinned to atomic_t.
    
     2. If the target is a CPU: access to the per-CPU constraints needs to
        be serialized with other CPU-target and task-target callers (to
        stabilize the bp_cpuinfo::tsk_pinned snapshot).
    
    We can allow the above concurrency by introducing a per-CPU constraints
    data reader-writer lock (bp_cpuinfo_sem), and per-task mutexes (reuses
    task_struct::perf_event_mutex):
    
      1. If the target is a task: acquires perf_event_mutex, and acquires
         bp_cpuinfo_sem as a reader. The choice of percpu-rwsem minimizes
         contention in the presence of many read-lock but few write-lock
         acquisitions: we assume many orders of magnitude more task target
         breakpoints creations/destructions than CPU target breakpoints.
    
      2. If the target is a CPU: acquires bp_cpuinfo_sem as a writer.
    
    With these changes, contention with thousands of tasks is reduced to the
    point where waiting on locking no longer dominates the profile:
    
     | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
     | # Running 'breakpoint/thread' benchmark:
     | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
     |      Total time: 0.077 [sec]
     |
     |       40.201563 usecs/op
     |     2572.900000 usecs/op/cpu
    
        21.54%  [kernel]       [k] task_bp_pinned
        20.18%  [kernel]       [k] rhashtable_jhash2
         6.81%  [kernel]       [k] toggle_bp_slot
         5.47%  [kernel]       [k] queued_spin_lock_slowpath
         3.75%  [kernel]       [k] smp_cfm_core_cond
         3.48%  [kernel]       [k] bcmp
    
    On this particular setup that's a speedup of 2.7x.
    
    We're also getting closer to the theoretical ideal performance through
    optimizations in hw_breakpoint.c -- constraints accounting disabled:
    
     | perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
     | # Running 'breakpoint/thread' benchmark:
     | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
     |      Total time: 0.067 [sec]
     |
     |       35.286458 usecs/op
     |     2258.333333 usecs/op/cpu
    
    Which means the current implementation is ~12% slower than the
    theoretical ideal.
    
    For reference, performance without any breakpoints:
    
     | $> bench -r 30 breakpoint thread -b 0 -p 64 -t 64
     | # Running 'breakpoint/thread' benchmark:
     | # Created/joined 30 threads with 0 breakpoints and 64 parallelism
     |      Total time: 0.060 [sec]
     |
     |       31.365625 usecs/op
     |     2007.400000 usecs/op/cpu
    
    On a system with 256 CPUs, the theoretical ideal is only ~12% slower
    than no breakpoints at all; the current implementation is ~28% slower.
    Signed-off-by: default avatarMarco Elver <elver@google.com>
    Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: default avatarDmitry Vyukov <dvyukov@google.com>
    Acked-by: default avatarIan Rogers <irogers@google.com>
    Link: https://lore.kernel.org/r/20220829124719.675715-12-elver@google.com
    0912037f
hw_breakpoint.c 21.2 KB