• Vijay Thakkar's avatar
    perf vendor events amd: Add Zen2 events · 2079f7aa
    Vijay Thakkar authored
    This patch adds PMU events for AMD Zen2 core based processors, namely,
    Matisse (model 71h), Castle Peak (model 31h) and Rome (model 2xh), as
    documented in the AMD Processor Programming Reference for Matisse [1].
    The model number regex has been set to detect all the models under
    family 17 that do not match those of Zen1, as the range is larger for
    zen2.
    
    Zen2 adds some additional counters that are not present in Zen1 and
    events for them have been added in this patch. Some counters have also
    been removed for Zen2 thatwere previously present in Zen1 and have been
    confirmed to always sample zero on zen2. These added/removed counters
    have been omitted for brevity but can be found here:
    https://gist.github.com/thakkarV/5b12ca5fd7488eb2c42e451e40bdd5f3
    
    Note that PPR for Zen2 [1] does not include some counters that were
    documented in the PPR for Zen1 based processors [2]. After having tested
    these counters, some of them that still work for zen2 systems have been
    preserved in the events for zen2. The counters that are omitted in [1]
    but are still measurable and non-zero on zen2 (tested on a Ryzen 3900X
    system) are the following:
    
      PMC 0x000 fpu_pipe_assignment.{total|total0|total1|total2|total3}
      PMC 0x004 fp_num_mov_elim_scal_op.*
      PMC 0x046 ls_tablewalker.*
      PMC 0x062 l2_latency.l2_cycles_waiting_on_fills
      PMC 0x063 l2_wcb_req.*
      PMC 0x06D l2_fill_pending.l2_fill_busy
      PMC 0x080 ic_fw32
      PMC 0x081 ic_fw32_miss
      PMC 0x086 bp_snp_re_sync
      PMC 0x087 ic_fetch_stall.*
      PMC 0x08C ic_cache_inval.*
      PMC 0x099 bp_tlb_rel
      PMC 0x0C7 ex_ret_brn_resync
      PMC 0x28A ic_oc_mode_switch.*
      L3PMC 0x001 l3_request_g1.*
      L3PMC 0x006 l3_comb_clstr_state.*
    
    [1]: Processor Programming Reference (PPR) for AMD Family 17h Model 71h,
    Revision B0 Processors, 56176 Rev 3.06 - Jul 17, 2019
    
    [2]: Processor Programming Reference (PPR) for AMD Family 17h Models
    01h,08h, Revision B2 Processors, 54945 Rev 3.03 - Jun 14, 2019
    
    All of the PPRs can be found at:
    
    https://bugzilla.kernel.org/show_bug.cgi?id=206537
    
    Here are the results of running "fpu_pipe_assignment.total" events on my
    Ryzen 3900X family 17h model 71h system:
    
    Before this patch:
    
      $> perf list *fpu_pipe_assignment*
    
    List of pre-defined events (to be used in -e):
    
    After:
    
      $> perf list *fpu_pipe_assignment*
    
      floating point:
      fpu_pipe_assignment.total
          [Total number of fp uOps]
      fpu_pipe_assignment.total0
          [Total number uOps assigned to pipe 0]
      fpu_pipe_assignment.total1
          [Total number uOps assigned to pipe 1]
      fpu_pipe_assignment.total2
          [Total number uOps assigned to pipe 2]
      fpu_pipe_assignment.total3
          [Total number uOps assigned to pipe 3]
    
      Metric Groups:
    
      $> perf stat -e fpu_pipe_assignment.total sleep 1
    
      Performance counter stats for 'sleep 1':
    
                  25,883      fpu_pipe_assignment.total
    
             1.004145868 seconds time elapsed
    
             0.001805000 seconds user
             0.000000000 seconds sys
    
    Usage tests while running Linpackin the background:
    
      $> perf stat -I1000 -e fpu_pipe_assignment.total
           1.000266796     79,313,191,516      fpu_pipe_assignment.total
           2.000809630     68,091,474,430      fpu_pipe_assignment.total
           3.001028115     52,925,023,174      fpu_pipe_assignment.total
    
      $> perf record -e fpu_pipe_assignment.total,fpu_pipe_assignment.total0 -a sleep 1
      [ perf record: Woken up 9 times to write data ]
      [ perf record: Captured and wrote 4.031 MB perf.data (64764 samples) ]
    
      $> perf report --stdio --no-header | head -30
          98.33%  xhpl             xhpl                          [.] dgemm_kernel
           0.28%  xhpl             xhpl                          [.] dtrsm_kernel_LT
           0.10%  xhpl             [kernel.kallsyms]             [k] entry_SYSCALL_64
           0.08%  xhpl             xhpl                          [.] idamax_k
           0.07%  baloo_file_extr  liblmdb.so                    [.] mdb_mid2l_insert
           0.06%  xhpl             xhpl                          [.] dgemm_itcopy
           0.06%  xhpl             xhpl                          [.] dgemm_oncopy
           0.06%  xhpl             [kernel.kallsyms]             [k] __schedule
           0.06%  xhpl             [kernel.kallsyms]             [k] syscall_trace_enter
           0.06%  xhpl             [kernel.kallsyms]             [k] native_sched_clock
           0.06%  xhpl             [kernel.kallsyms]             [k] pick_next_task_fair
           0.05%  xhpl             xhpl                          [.] blas_thread_server.llvm.15009391670273914865
           0.04%  xhpl             [kernel.kallsyms]             [k] do_syscall_64
           0.04%  xhpl             [kernel.kallsyms]             [k] yield_task_fair
           0.04%  xhpl             libpthread-2.31.so            [.] __pthread_mutex_unlock_usercnt
           0.03%  xhpl             [kernel.kallsyms]             [k] cpuacct_charge
           0.03%  xhpl             [kernel.kallsyms]             [k] syscall_return_via_sysret
           0.03%  xhpl             libc-2.31.so                  [.] __sched_yield
           0.03%  xhpl             [kernel.kallsyms]             [k] __calc_delta
    
      $> perf annotate --stdio2 dgemm_kernel | egrep '^ {0,2}[0-9]+' -B2 -A2
                      sub          $0x60,%rsp
                      mov          %rbx,(%rsp)
        0.00          mov          %rbp,0x8(%rsp)
                      mov          %r12,0x10(%rsp)
        0.00          mov          %r13,0x18(%rsp)
                      mov          %r14,0x20(%rsp)
                      mov          %r15,0x28(%rsp)
      --
                      mov          %rdi,%r13
                      mov          %rsi,0x28(%rsp)
        0.00          mov          %rdx,%r12
                      vmovsd       %xmm0,0x30(%rsp)
                      shl          $0x3,%r10
                      mov          0x28(%rsp),%rax
        0.00          xor          %rdx,%rdx
                      mov          $0x18,%rdi
                      div          %rdi
      --
                      nop
                a0:   mov          %r12,%rax
        0.00          shl          $0x3,%rax
                      mov          %r8,%rdi
                      lea          (%r8,%rax,8),%r15
      --
                      mov          %r12,%rax
                      nop
        0.00    c0:   vmovups      (%rdi),%ymm1
        0.09          vmovups      0x20(%rdi),%ymm2
        0.02          vmovups      (%r15),%ymm3
        0.10          vmovups      %ymm1,(%rsi)
        0.07          vmovups      %ymm2,0x20(%rsi)
        0.07          vmovups      %ymm3,0x40(%rsi)
        0.06          add          $0x40,%rdi
                      add          $0x40,%r15
                      add          $0x60,%rsi
        0.00          dec          %rax
                    ↑ jne          c0
                      mov          %r9,%r15
      --
                      nop
               110:   lea          0x80(%rsp),%rsi
        0.01          add          $0x60,%rsi
        0.03          mov          %r12,%rax
        0.00          sar          $0x3,%rax
                      cmp          $0x2,%rax
                    ↓ jl           d26
                      prefetcht0   0x200(%rdi)
        0.01          vmovups      -0x60(%rsi),%ymm1
        0.02          prefetcht0   0xa0(%rsi)
        0.00          vbroadcastsd -0x80(%rdi),%ymm0
        0.00          prefetcht0   0xe0(%rsi)
        0.03          vmovups      -0x40(%rsi),%ymm2
        0.00          prefetcht0   0x120(%rsi)
                      vmovups      -0x20(%rsi),%ymm3
                      vmulpd       %ymm0,%ymm1,%ymm4
        0.01          prefetcht0   0x160(%rsi)
                      vmulpd       %ymm0,%ymm2,%ymm8
        0.01          vmulpd       %ymm0,%ymm3,%ymm12
        0.02          prefetcht0   0x1a0(%rsi)
        0.01          vbroadcastsd -0x78(%rdi),%ymm0
                      vmulpd       %ymm0,%ymm1,%ymm5
        0.01          vmulpd       %ymm0,%ymm2,%ymm9
                      vmulpd       %ymm0,%ymm3,%ymm13
        0.01          vbroadcastsd -0x70(%rdi),%ymm0
                      vmulpd       %ymm0,%ymm1,%ymm6
        0.00          vmulpd       %ymm0,%ymm2,%ymm10
        0.00          add          $0x60,%rsi
    
      ... snip ...
    
                      nop
              65e0:   vmovddup     -0x60(%rsi),%xmm2
        0.00          vmovups      -0x80(%rdi),%xmm0
                      vmovups      -0x70(%rdi),%xmm1
        0.00          vmovddup     -0x58(%rsi),%xmm3
                      vfmadd231pd  %xmm0,%xmm2,%xmm4
        0.00          vfmadd231pd  %xmm1,%xmm2,%xmm5
        0.00          vfmadd231pd  %xmm0,%xmm3,%xmm6
        0.00          vfmadd231pd  %xmm1,%xmm3,%xmm7
        0.00          add          $0x10,%rsi
                      add          $0x20,%rdi
        0.00          dec          %rax
                    ↑ jne          65e0
                      nop
                      nop
              6620:   vmovddup     0x30(%rsp),%xmm0
        0.00          vmulpd       %xmm0,%xmm4,%xmm4
        0.00          vmulpd       %xmm0,%xmm5,%xmm5
                      vmulpd       %xmm0,%xmm6,%xmm6
                      vmulpd       %xmm0,%xmm7,%xmm7
                      vaddpd       (%r15),%xmm4,%xmm4
                      vaddpd       0x10(%r15),%xmm5,%xmm5
        0.00          vaddpd       (%r15,%r10,1),%xmm6,%xmm6
        0.00          vaddpd       0x10(%r15,%r10,1),%xmm7,%xmm7
        0.00          vmovups      %xmm4,(%r15)
                      vmovups      %xmm5,0x10(%r15)
        0.00          vmovups      %xmm6,(%r15,%r10,1)
                      vmovups      %xmm7,0x10(%r15,%r10,1)
                      add          $0x20,%r15
      --
                      lea          (%r8,%rax,8),%r8
              69d8:   mov          0x20(%rsp),%r14
        0.00          test         $0x1,%r14
                    ↓ je           6d84
                      mov          %r9,%r15
      --
                      vbroadcastsd -0x28(%rsi),%ymm3
                      vfmadd231pd  (%rdi),%ymm0,%ymm4
        0.00          vfmadd231pd  0x20(%rdi),%ymm1,%ymm5
                      vfmadd231pd  0x40(%rdi),%ymm2,%ymm6
                      vfmadd231pd  0x60(%rdi),%ymm3,%ymm7
      --
                      vmulpd       %ymm0,%ymm4,%ymm4
                      vaddpd       (%r15),%ymm4,%ymm4
        0.00          vmovups      %ymm4,(%r15)
                      add          $0x20,%r15
                      dec          %r11
      --
                      mov          %rbx,%rsp
                      mov          (%rsp),%rbx
        0.01          mov          0x8(%rsp),%rbp
                      mov          0x10(%rsp),%r12
                      mov          0x18(%rsp),%r13
    Signed-off-by: default avatarVijay Thakkar <vijaythakkar@me.com>
    Tested-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
    Acked-by: default avatarKim Phillips <kim.phillips@amd.com>
    Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
    Cc: Jiri Olsa <jolsa@redhat.com>
    Cc: Jon Grimm <jon.grimm@amd.com>
    Cc: Martin Liška <mliska@suse.cz>
    Cc: Namhyung Kim <namhyung@kernel.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Link: http://lore.kernel.org/lkml/20200318190002.307290-3-vijaythakkar@me.comSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
    2079f7aa
cache.json 12.4 KB