• David Carrillo-Cisneros's avatar
    perf/core: Make cgroup switch visit only cpuctxs with cgroup events · 058fe1c0
    David Carrillo-Cisneros authored
    This patch follows from a conversation in CQM/CMT's last series about
    speeding up the context switch for cgroup events:
    
      https://patchwork.kernel.org/patch/9478617/
    
    This is a low-hanging fruit optimization. It replaces the iteration over
    the "pmus" list in cgroup switch by an iteration over a new list that
    contains only cpuctxs with at least one cgroup event.
    
    This is necessary because the number of PMUs have increased over the years
    e.g modern x86 server systems have well above 50 PMUs.
    
    The iteration over the full PMU list is unneccessary and can be costly in
    heavy cache contention scenarios.
    
    Below are some instrumentation measurements with 10, 50 and 90 percentiles
    of the total cost of context switch before and after this optimization for
    a simple array read/write microbenchark.
    
      Contention
        Level    Nr events      Before (us)            After (us)       Median
      L2    L3     types      (10%, 50%, 90%)       (10%, 50%, 90%     Speedup
      --------------------------------------------------------------------------
      Low   Low       1       (1.72, 2.42, 5.85)    (1.35, 1.64, 5.46)     29%
      High  Low       1       (2.08, 4.56, 19.8)    (1720, 2.20, 13.7)     51%
      High  High      1       (2.86, 10.4, 12.7)    (2.54, 4.32, 12.1)     58%
    
      Low   Low       2       (1.98, 3.20, 6.89)    (1.68, 2.41, 8.89)     24%
      High  Low       2       (2.48, 5.28, 22.4)    (2150, 3.69, 14.6)     30%
      High  High      2       (3.32, 8.09, 13.9)    (2.80, 5.15, 13.7)     36%
    
    where:
    
      1 event type  = cycles
      2 event types = cycles,intel_cqm/llc_occupancy/
    
       Contention L2 Low: workset  <  L2 cache size.
                     High:  "     >>  L2   "     " .
       Contention L3 Low: workset of task on all sockets  <  L3 cache size.
                     High:   "     "   "   "   "    "    >>  L3   "     " .
    
       Median Speedup is (50%ile Before - 50%ile After) /  50%ile Before
    
    Unsurprisingly, the benefits of this optimization decrease with the number
    of cpuctxs with a cgroup events, yet, is never detrimental.
    Tested-by: default avatarMark Rutland <mark.rutland@arm.com>
    Signed-off-by: default avatarDavid Carrillo-Cisneros <davidcc@google.com>
    Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
    Acked-by: default avatarMark Rutland <mark.rutland@arm.com>
    Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
    Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
    Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
    Cc: Borislav Petkov <bp@suse.de>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Jiri Olsa <jolsa@redhat.com>
    Cc: Kan Liang <kan.liang@intel.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Paul Turner <pjt@google.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
    Cc: Stephane Eranian <eranian@google.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>
    Cc: Vince Weaver <vince@deater.net>
    Cc: Vince Weaver <vincent.weaver@maine.edu>
    Link: http://lkml.kernel.org/r/20170118192454.58008-2-davidcc@google.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
    058fe1c0
perf_event.h 38.9 KB