1. 30 Apr, 2020 3 commits
    • Ian Rogers's avatar
      perf synthetic events: Remove use of sscanf from /proc reading · 2069425e
      Ian Rogers authored
      The synthesize benchmark, run on a single process and thread, shows
      perf_event__synthesize_mmap_events as the hottest function with fgets
      and sscanf taking the majority of execution time.
      
      fscanf performs similarly well. Replace the scanf call with manual
      reading of each field of the /proc/pid/maps line, and remove some
      unnecessary buffering.
      
      This change also addresses potential, but unlikely, buffer overruns for
      the string values read by scanf.
      
      Performance before is:
      
        $ sudo perf bench internals synthesize -m 16 -M 16 -s -t
        \# Running 'internals/synthesize' benchmark:
        Computing performance of single threaded perf event synthesis by
        synthesizing events on the perf process itself:
          Average synthesis took: 102.810 usec (+- 0.027 usec)
          Average num. events: 17.000 (+- 0.000)
          Average time per event 6.048 usec
          Average data synthesis took: 106.325 usec (+- 0.018 usec)
          Average num. events: 89.000 (+- 0.000)
          Average time per event 1.195 usec
        Computing performance of multi threaded perf event synthesis by
        synthesizing events on CPU 0:
          Number of synthesis threads: 16
            Average synthesis took: 68103.100 usec (+- 441.234 usec)
            Average num. events: 30703.000 (+- 0.730)
            Average time per event 2.218 usec
      
      And after is:
      
        $ sudo perf bench internals synthesize -m 16 -M 16 -s -t
        \# Running 'internals/synthesize' benchmark:
        Computing performance of single threaded perf event synthesis by
        synthesizing events on the perf process itself:
          Average synthesis took: 50.388 usec (+- 0.031 usec)
          Average num. events: 17.000 (+- 0.000)
          Average time per event 2.964 usec
          Average data synthesis took: 52.693 usec (+- 0.020 usec)
          Average num. events: 89.000 (+- 0.000)
          Average time per event 0.592 usec
        Computing performance of multi threaded perf event synthesis by
        synthesizing events on CPU 0:
          Number of synthesis threads: 16
            Average synthesis took: 45022.400 usec (+- 552.740 usec)
            Average num. events: 30624.200 (+- 10.037)
            Average time per event 1.470 usec
      
      On a Intel Xeon 6154 compiling with Debian gcc 9.2.1.
      
      Committer testing:
      
      On a AMD Ryzen 5 3600X 6-Core Processor:
      
      Before:
      
        # perf bench internals synthesize --min-threads 12 --max-threads 12 --st --mt
        # Running 'internals/synthesize' benchmark:
        Computing performance of single threaded perf event synthesis by
        synthesizing events on the perf process itself:
          Average synthesis took: 267.491 usec (+- 0.176 usec)
          Average num. events: 56.000 (+- 0.000)
          Average time per event 4.777 usec
          Average data synthesis took: 277.257 usec (+- 0.169 usec)
          Average num. events: 287.000 (+- 0.000)
          Average time per event 0.966 usec
        Computing performance of multi threaded perf event synthesis by
        synthesizing events on CPU 0:
          Number of synthesis threads: 12
            Average synthesis took: 81599.500 usec (+- 346.315 usec)
            Average num. events: 36096.100 (+- 2.523)
            Average time per event 2.261 usec
        #
      
      After:
      
        # perf bench internals synthesize --min-threads 12 --max-threads 12 --st --mt
        # Running 'internals/synthesize' benchmark:
        Computing performance of single threaded perf event synthesis by
        synthesizing events on the perf process itself:
          Average synthesis took: 110.125 usec (+- 0.080 usec)
          Average num. events: 56.000 (+- 0.000)
          Average time per event 1.967 usec
          Average data synthesis took: 118.518 usec (+- 0.057 usec)
          Average num. events: 287.000 (+- 0.000)
          Average time per event 0.413 usec
        Computing performance of multi threaded perf event synthesis by
        synthesizing events on CPU 0:
          Number of synthesis threads: 12
            Average synthesis took: 43490.700 usec (+- 284.527 usec)
            Average num. events: 37028.500 (+- 0.563)
            Average time per event 1.175 usec
        #
      Signed-off-by: default avatarIan Rogers <irogers@google.com>
      Tested-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      Acked-by: default avatarJiri Olsa <jolsa@redhat.com>
      Acked-by: default avatarNamhyung Kim <namhyung@kernel.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andrey Zhizhikin <andrey.z@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lore.kernel.org/lkml/20200415054050.31645-4-irogers@google.comSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      2069425e
    • Ian Rogers's avatar
      tools api: Add a lightweight buffered reading api · e95770af
      Ian Rogers authored
      The synthesize benchmark shows the majority of execution time going to
      fgets and sscanf, necessary to parse /proc/pid/maps. Add a new buffered
      reading library that will be used to replace these calls in a follow-up
      CL. Add tests for the library to perf test.
      
      Committer tests:
      
        $ perf test api
        63: Test api io                                           : Ok
        $
      Signed-off-by: default avatarIan Rogers <irogers@google.com>
      Acked-by: default avatarJiri Olsa <jolsa@redhat.com>
      Acked-by: default avatarNamhyung Kim <namhyung@kernel.org>
      Tested-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andrey Zhizhikin <andrey.z@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lore.kernel.org/lkml/20200415054050.31645-3-irogers@google.comSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      e95770af
    • Ian Rogers's avatar
      perf bench: Add a multi-threaded synthesize benchmark · 13edc237
      Ian Rogers authored
      By default this isn't run as it reads /proc and may not have access.
      For consistency, modify the single threaded benchmark to compute an
      average time per event.
      
      Committer testing:
      
        $ grep -m1 "model name" /proc/cpuinfo
        model name	: Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
        $ grep "model name" /proc/cpuinfo  | wc -l
        8
        $
        $ perf bench internals synthesize -h
        # Running 'internals/synthesize' benchmark:
      
         Usage: perf bench internals synthesize <options>
      
            -I, --multi-iterations <n>
                                  Number of iterations used to compute multi-threaded average
            -i, --single-iterations <n>
                                  Number of iterations used to compute single-threaded average
            -M, --max-threads <n>
                                  Maximum number of threads in multithreaded bench
            -m, --min-threads <n>
                                  Minimum number of threads in multithreaded bench
            -s, --st              Run single threaded benchmark
            -t, --mt              Run multi-threaded benchmark
      
        $
        $ perf bench internals synthesize -t
        # Running 'internals/synthesize' benchmark:
        Computing performance of multi threaded perf event synthesis by
        synthesizing events on CPU 0:
          Number of synthesis threads: 1
            Average synthesis took: 65449.000 usec (+- 586.442 usec)
            Average num. events: 9405.400 (+- 0.306)
            Average time per event 6.959 usec
          Number of synthesis threads: 2
            Average synthesis took: 37838.300 usec (+- 130.259 usec)
            Average num. events: 9501.800 (+- 20.469)
            Average time per event 3.982 usec
          Number of synthesis threads: 3
            Average synthesis took: 48551.400 usec (+- 225.686 usec)
            Average num. events: 9544.000 (+- 0.000)
            Average time per event 5.087 usec
          Number of synthesis threads: 4
            Average synthesis took: 29632.500 usec (+- 50.808 usec)
            Average num. events: 9544.000 (+- 0.000)
            Average time per event 3.105 usec
          Number of synthesis threads: 5
            Average synthesis took: 33920.400 usec (+- 284.509 usec)
            Average num. events: 9544.000 (+- 0.000)
            Average time per event 3.554 usec
          Number of synthesis threads: 6
            Average synthesis took: 27604.100 usec (+- 72.344 usec)
            Average num. events: 9548.000 (+- 0.000)
            Average time per event 2.891 usec
          Number of synthesis threads: 7
            Average synthesis took: 25406.300 usec (+- 933.371 usec)
            Average num. events: 9545.500 (+- 0.167)
            Average time per event 2.662 usec
          Number of synthesis threads: 8
            Average synthesis took: 24110.400 usec (+- 73.229 usec)
            Average num. events: 9551.000 (+- 0.000)
            Average time per event 2.524 usec
        $
      Signed-off-by: default avatarIan Rogers <irogers@google.com>
      Tested-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      Acked-by: default avatarJiri Olsa <jolsa@redhat.com>
      Acked-by: default avatarNamhyung Kim <namhyung@kernel.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andrey Zhizhikin <andrey.z@gmail.com>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lore.kernel.org/lkml/20200415054050.31645-2-irogers@google.comSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      13edc237
  2. 23 Apr, 2020 3 commits
    • Stephane Eranian's avatar
      perf record: Add num-synthesize-threads option · d99c22ea
      Stephane Eranian authored
      To control degree of parallelism of the synthesize_mmap() code which
      is scanning /proc/PID/task/PID/maps and can be time consuming.
      Mimic perf top way of handling the option.
      If not specified will default to 1 thread, i.e. default behavior before
      this option.
      
      On a desktop computer the processing of /proc/PID/task/PID/maps isn't
      slow enough to warrant parallel processing and the thread creation has
      some cost - hence the default of 1. On a loaded server with
      >100 cores it is possible to see synthesis times in the order of
      seconds and in this case having the option is desirable.
      
      As the processing is a synchronization point, it is legitimate to worry if
      Amdahl's law will apply to this patch. Profiling with this patch in
      place:
      https://lore.kernel.org/lkml/20200415054050.31645-4-irogers@google.com/
      shows:
      ...
            - 32.59% __perf_event__synthesize_threads
               - 32.54% __event__synthesize_thread
                  + 22.13% perf_event__synthesize_mmap_events
                  + 6.68% perf_event__get_comm_ids.constprop.0
                  + 1.49% process_synthesized_event
                  + 1.29% __GI___readdir64
                  + 0.60% __opendir
      ...
      That is the processing is 1.49% of execution time and there is plenty to
      make parallel. This is shown in the benchmark in this patch:
      
      https://lore.kernel.org/lkml/20200415054050.31645-2-irogers@google.com/
      
        Computing performance of multi threaded perf event synthesis by
        synthesizing events on CPU 0:
         Number of synthesis threads: 1
           Average synthesis took: 127729.000 usec (+- 3372.880 usec)
           Average num. events: 21548.600 (+- 0.306)
           Average time per event 5.927 usec
         Number of synthesis threads: 2
           Average synthesis took: 88863.500 usec (+- 385.168 usec)
           Average num. events: 21552.800 (+- 0.327)
           Average time per event 4.123 usec
         Number of synthesis threads: 3
           Average synthesis took: 83257.400 usec (+- 348.617 usec)
           Average num. events: 21553.200 (+- 0.327)
           Average time per event 3.863 usec
         Number of synthesis threads: 4
           Average synthesis took: 75093.000 usec (+- 422.978 usec)
           Average num. events: 21554.200 (+- 0.200)
           Average time per event 3.484 usec
         Number of synthesis threads: 5
           Average synthesis took: 64896.600 usec (+- 353.348 usec)
           Average num. events: 21558.000 (+- 0.000)
           Average time per event 3.010 usec
         Number of synthesis threads: 6
           Average synthesis took: 59210.200 usec (+- 342.890 usec)
           Average num. events: 21560.000 (+- 0.000)
           Average time per event 2.746 usec
         Number of synthesis threads: 7
           Average synthesis took: 54093.900 usec (+- 306.247 usec)
           Average num. events: 21562.000 (+- 0.000)
           Average time per event 2.509 usec
         Number of synthesis threads: 8
           Average synthesis took: 48938.700 usec (+- 341.732 usec)
           Average num. events: 21564.000 (+- 0.000)
           Average time per event 2.269 usec
      
      Where average time per synthesized event goes from 5.927 usec with 1
      thread to 2.269 usec with 8. This isn't a linear speed up as not all of
      synthesize code has been made parallel. If the synthesis time was about
      10 seconds then using 8 threads may bring this down to less than 4.
      Signed-off-by: default avatarStephane Eranian <eranian@google.com>
      Reviewed-by: default avatarIan Rogers <irogers@google.com>
      Acked-by: default avatarJiri Olsa <jolsa@redhat.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alexey Budankov <alexey.budankov@linux.intel.com>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tony Jones <tonyj@suse.de>
      Cc: yuzhoujian <yuzhoujian@didichuxing.com>
      Link: http://lore.kernel.org/lkml/20200422155038.9380-1-irogers@google.comSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      d99c22ea
    • Tommi Rantala's avatar
      perf test session topology: Fix data path · dbd660e6
      Tommi Rantala authored
      Commit 2d4f2799 ("perf data: Add global path holder") missed path
      conversion in tests/topology.c, causing the "Session topology" testcase
      to "hang" (waits forever for input from stdin) when doing "ssh $VM perf
      test".
      
      Can be reproduced by running "cat | perf test topo", and crashed by
      replacing cat with true:
      
        $ true | perf test -v topo
        40: Session topology                                      :
        --- start ---
        test child forked, pid 3638
        templ file: /tmp/perf-test-QPvAch
        incompatible file format
        incompatible file format (rerun with -v to learn more)
        free(): invalid pointer
        test child interrupted
        ---- end ----
        Session topology: FAILED!
      
      Committer testing:
      
      Reproduced the above result before the patch and after it is back
      working:
      
        # true | perf test -v topo
        41: Session topology                                      :
        --- start ---
        test child forked, pid 19374
        templ file: /tmp/perf-test-YOTEQg
        CPU 0, core 0, socket 0
        CPU 1, core 1, socket 0
        CPU 2, core 2, socket 0
        CPU 3, core 3, socket 0
        CPU 4, core 0, socket 0
        CPU 5, core 1, socket 0
        CPU 6, core 2, socket 0
        CPU 7, core 3, socket 0
        test child finished with 0
        ---- end ----
        Session topology: Ok
        #
      
      Fixes: 2d4f2799 ("perf data: Add global path holder")
      Signed-off-by: default avatarTommi Rantala <tommi.t.rantala@nokia.com>
      Tested-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      Acked-by: default avatarJiri Olsa <jolsa@redhat.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Mamatha Inamdar <mamatha4@linux.vnet.ibm.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
      Link: http://lore.kernel.org/lkml/20200423115341.562782-1-tommi.t.rantala@nokia.comSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      dbd660e6
    • Jin Yao's avatar
      perf stat: Improve runtime stat for interval mode · 197ba86f
      Jin Yao authored
      For interval mode, the metric is printed after the '#' character if it
      exists. But it's not calculated by the counts generated in this
      interval.
      
      See the following examples:
      
        root@kbl-ppc:~# perf stat -M CPI -I1000 --interval-count 2
        #           time             counts unit events
             1.000422803            764,809      inst_retired.any          #      2.9 CPI
             1.000422803          2,234,932      cycles
             2.001464585          1,960,061      inst_retired.any          #      1.6 CPI
             2.001464585          4,022,591      cycles
      
      The second CPI should not be 1.6 (4,022,591/1,960,061 is 2.1)
      
        root@kbl-ppc:~# perf stat -e cycles,instructions -I1000 --interval-count 2
        #           time             counts unit events
             1.000429493          2,869,311      cycles
             1.000429493            816,875      instructions              #    0.28  insn per cycle
             2.001516426          9,260,973      cycles
             2.001516426          5,250,634      instructions              #    0.87  insn per cycle
      
      The second 'insn per cycle' should not be 0.87 (5,250,634/9,260,973 is
      0.57).
      
      The current code uses a global variable 'rt_stat' for tracking and
      updating the std dev of runtime stat. Unlike the counts, 'rt_stat' is not
      reset for interval. While the counts are reset for interval.
      
        perf_stat_process_counter()
        {
                if (config->interval)
                        init_stats(ps->res_stats);
        }
      
      So for interval mode, the 'rt_stat' variable should be reset too.
      
      This patch resets 'rt_stat' before read_counters(), so the runtime stat
      is only calculated by the counts generated in this interval.
      
      With this patch:
      
        root@kbl-ppc:~# perf stat -M CPI -I1000 --interval-count 2
        #           time             counts unit events
             1.000420924          2,408,818      inst_retired.any          #      2.1 CPI
             1.000420924          5,010,111      cycles
             2.001448579          2,798,407      inst_retired.any          #      1.6 CPI
             2.001448579          4,599,861      cycles
      
        root@kbl-ppc:~# perf stat -e cycles,instructions -I1000 --interval-count 2
        #           time             counts unit events
             1.000428555          2,769,714      cycles
             1.000428555            774,462      instructions              #    0.28  insn per cycle
             2.001471562          3,595,904      cycles
             2.001471562          1,243,703      instructions              #    0.35  insn per cycle
      
      Now the second 'insn per cycle' and CPI are calculated by the counts
      generated in this interval.
      Signed-off-by: default avatarJin Yao <yao.jin@linux.intel.com>
      Acked-by: default avatarJiri Olsa <jolsa@redhat.com>
      Tested-By: default avatarKajol Jain <kjain@linux.ibm.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Jin Yao <yao.jin@intel.com>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lore.kernel.org/lkml/20200420145417.6864-1-yao.jin@linux.intel.comSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      197ba86f
  3. 22 Apr, 2020 6 commits
  4. 21 Apr, 2020 22 commits
  5. 20 Apr, 2020 6 commits