• David Carrillo-Cisneros's avatar
    perf/core: Set cgroup in CPU contexts for new cgroup events · db4a8356
    David Carrillo-Cisneros authored
    There's a perf stat bug easy to observer on a machine with only one cgroup:
    
      $ perf stat -e cycles -I 1000 -C 0 -G /
      #          time             counts unit events
          1.000161699      <not counted>      cycles                    /
          2.000355591      <not counted>      cycles                    /
          3.000565154      <not counted>      cycles                    /
          4.000951350      <not counted>      cycles                    /
    
    We'd expect some output there.
    
    The underlying problem is that there is an optimization in
    perf_cgroup_sched_{in,out}() that skips the switch of cgroup events
    if the old and new cgroups in a task switch are the same.
    
    This optimization interacts with the current code in two ways
    that cause a CPU context's cgroup (cpuctx->cgrp) to be NULL even if a
    cgroup event matches the current task. These are:
    
      1. On creation of the first cgroup event in a CPU: In current code,
      cpuctx->cpu is only set in perf_cgroup_sched_in, but due to the
      aforesaid optimization, perf_cgroup_sched_in will run until the next
      cgroup switches in that CPU. This may happen late or never happen,
      depending on system's number of cgroups, CPU load, etc.
    
      2. On deletion of the last cgroup event in a cpuctx: In list_del_event,
      cpuctx->cgrp is set NULL. Any new cgroup event will not be sched in
      because cpuctx->cgrp == NULL until a cgroup switch occurs and
      perf_cgroup_sched_in is executed (updating cpuctx->cgrp).
    
    This patch fixes both problems by setting cpuctx->cgrp in list_add_event,
    mirroring what list_del_event does when removing a cgroup event from CPU
    context, as introduced in:
    
      commit 68cacd29 ("perf_events: Fix stale ->cgrp pointer in update_cgrp_time_from_cpuctx()")
    
    With this patch, cpuctx->cgrp is always set/clear when installing/removing
    the first/last cgroup event in/from the CPU context. With cpuctx->cgrp
    correctly set, event_filter_match works as intended when events are
    sched in/out.
    
    After the fix, the output is as expected:
    
      $ perf stat -e cycles -I 1000 -a -G /
      #         time             counts unit events
         1.004699159          627342882      cycles                    /
         2.007397156          615272690      cycles                    /
         3.010019057          616726074      cycles                    /
    Signed-off-by: default avatarDavid Carrillo-Cisneros <davidcc@google.com>
    Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
    Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
    Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
    Cc: Jiri Olsa <jolsa@redhat.com>
    Cc: Kan Liang <kan.liang@intel.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Paul Turner <pjt@google.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Stephane Eranian <eranian@google.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Vegard Nossum <vegard.nossum@gmail.com>
    Cc: Vince Weaver <vincent.weaver@maine.edu>
    Link: http://lkml.kernel.org/r/1470124092-113192-1-git-send-email-davidcc@google.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
    db4a8356
perf_event.h 38.1 KB